endjin.com

SQLBits 2026: A Conference Recap

Carmel Eve — Tue, 05 May 2026 00:00:00 GMT

I recently attended SQLBits 2026 at the ICC Wales - one of the largest data platform conferences in Europe. In the run-up to the conference, I was part of the SQLBits DEI (Diversity, Equity and Inclusion) panel, where we worked on various initiatives to make the event more inclusive - from the "One SQLBits, Many Nations" map in the community hub, to non-sensory food options, all-day snacks, and the "Find Your People" networking sessions. It was brilliant to see a lot of this come together on the day (and especially lovely to see so many kids and parents at the Code Club sessions on the Saturday!).

Across four days, I went to sessions on everything from the SQL Server roadmap and DAX deep dives, to ingesting SharePoint data into Fabric, home automation, inclusive team building, and avoiding burnout. It was a packed few days, and I came away with a lot of new knowledge and plenty to think about.

This post gives an overview of the sessions I attended. I'll also be writing some deeper technical posts on the topics that I found most interesting!

Day 1

Keynote - The SQL Roadmap

Priya Sathy, Shiva Gurumurthy, Bob Ward, Anna Hoffman

The conference kicked off with a Keynote focused on the SQL Server roadmap. There were a lot of announcements about migrating to Azure SQL Managed Instance, including cost estimation tools, the ability to check the effect of changes via lineage, and the fact that you can now manage memory and performance independently. They also touched on VMs - updating to the new series can provide huge throughput increases for analytics, even with smaller VMs.

Anna Hoffman did a couple of great demos. The first used GitHub Copilot to analyse why a stored procedure was running slowly. The second showcased the notebook experience in the mssql VS Code extension, where you can create reports and export them for sharing.

There was also a section on SQL Database in Fabric, highlighting how it's natively integrated into OneLake, autonomous, secure, and optimised for AI. Anna demonstrated using the mssql extension in VS Code whilst connected to Fabric to make changes to a database, and then showing how those changes were reflected in the Fabric UI and could be committed to Git. It's clear that the developer experience for SQL in Fabric is getting a lot of attention.

Finally, there was a section on monitoring in SQL and database agents (all still in preview). This included estate-level triage, asking Copilot for insights, and agents putting alerts straight into Teams and tagging relevant people - a promising step towards more proactive database management.

Variable Libraries in Fabric

Kevin Arnold

Next up was a session on Microsoft Fabric variable libraries - originally scheduled as a 20-minute slot, but they let him continue into the break and he ended up speaking for around 45 minutes (which was a good thing, as there was plenty to cover!).

Variable libraries are the mechanism intended for managing environment-specific configuration in Fabric. The session covered what they are, how they work across different Fabric item types, and some best practices for using them safely.

Watch out for a new post on best practices when using variable libraries!

Microsoft Fabric and the Mess of SharePoint

Laura Graham-Brown

Let's be honest - there's no escaping SharePoint. This session from Laura Graham-Brown explored the various options for ingesting SharePoint data into Microsoft Fabric. She was a great presenter and covered a lot of ground, walking through the practical trade-offs of each approach.

Again, keep an eye out for an upcoming post which will go through this in more detail.

Conversational Analytics at Lloyd's Banking Group

Andrew Herman, Sean Hughes

This was an interesting session about Lloyd's Banking Group's journey from traditional reporting to "GenBI" (Generative BI). The scale involved was impressive - over 100,000 reports, 3 million interactions per month, 27,000 workspaces, and 65,000 active users.

As part of their journey, they did a PoC to prove whether switching from traditional reports to conversational AI would help with the classic challenges of slow insights, inefficiency, and data silos.

What I found most interesting was less the AI angle itself, and more the broader transformation approach they described: proving value early, measuring impact quantitatively, winning senior sponsorship, building capability across the organisation, establishing a single source of truth for key datasets, and baking in CI/CD and governance from the start.

These are solid data transformation principles regardless of whether AI is involved - though the AI angle certainly helps with getting that initial buy-in and excitement!

Fabric Admin Panel

I attended a Fabric admin panel session hoping for discussion around workspace management and change management best practices. It ended up being more of a Q&A driven by the audience, which was mostly made up of Fabric administrators.

A few interesting things came up:

The Fabric team are considering workspace-level tags (high/medium/low priority), where jobs can then be prioritised accordingly.
They also recommended building your own Fabric capacity metrics app - which, reading between the lines, suggests the built-in monitoring story still has room for improvement.
They heavily implied that per-operation metrics are coming, which would be a welcome addition!

There was also an interesting discussion about governance versus usability. The general advice was to lean towards openness and invest in education, with the reasoning being that overly restrictive policies tend to backfire - users will find creative workarounds. (The example given was users screenshotting Power BI reports and extracting data from the screenshots after exports were disabled!)

Shine Bright Like a Star, Without Burning Out!

Gloria Georgieva Clare

As I always try to at conferences, I made sure to attend at least one less technical talk. Gloria Georgieva Clare gave an open and personable session on avoiding burnout, and I'm really glad I went.

She talked about the concept of having four "hobs" - health, social, work, and hobbies - and how you can't have all four firing at full power at any one time. She recommended a few books: The Burnout Society, Fair Play (about unseen, unpaid domestic labour), and Solve Your Stress Cycle.

One of the things that resonated with me most was the idea of defining personal "drivers and guardrails" - concrete commitments like "I want to meet a friend after work once a week" or "I shouldn't be eating lunch at my desk". She also made the important point that just because a stressful situation has ended, doesn't mean the effects of it instantly go away.

I think reflecting on these things semi-regularly is really valuable, and a conference is a good prompt for doing it.

Day 2 - Optimising DAX Queries (Full-Day Workshop)

(Alberto Ferrari - SQLBI)

Day 2 I attended a full-day training workshop with Alberto Ferrari on optimising DAX queries. I learnt a huge amount about how the underlying engines that power Power BI work, and how the in-memory database (VertiPaq) is optimised. We went through real examples of slow queries and why they were slow - often for reasons I would never have known about beforehand.

The workshop covered everything from how VertiPaq stores data using column-oriented storage and various encoding techniques, through to how the formula engine and storage engine work together when executing DAX queries, and practical techniques for identifying and fixing performance bottlenecks.

Watch out for an upcoming series, which will run through this in much more detail.

Day 3

Day 3 was a bit more relaxed. One of the nice things about SQLBits is that the conference social events (in this case, a pub quiz the night before) are actually a good way to meet people. I spent the morning chatting to various people I'd met, which is probably as valuable as any session.

What an Automaton Nerd Learnt by Automating Their Home

Rob Sewell

This was a fun one. Rob Sewell talked about how he'd automated his house using Home Assistant - an open-source home automation platform. He had a network of hundreds of devices all connected via a local network, which was pretty impressive.

What I liked most about this talk was his point that even when you're doing something for fun at home, proper engineering practices and requirements gathering still matter. He had some great examples of what happens when you skip them - including a security setup designed to alert them if someone was in the garden after dark (via a foghorn noise and all the lights turning on), which worked perfectly until they realised they'd forgotten to account for the cat.

People first: building strong, inclusive data teams that actually deliver

Hollie Whittles

This was a 20-minute session, and while it was fairly high-level, there were some points that stuck with me.

Hollie Whittles gave some statistics around how communication is almost always what causes projects to fail, and talked about steps they'd taken to improve it within their team - including anonymous feedback mechanisms and surveys to better understand working styles.

There was a section on neurodiversity that I found particularly interesting. An estimated 30% of people in tech are neurodiverse, but only 16% say there's ever been a conversation about it in their workplace. She highlighted some common workplace practices that can unintentionally disadvantage neurodiverse team members:

Meetings without agendas - people who need more prep time are disadvantaged before the meeting has even begun. Only giving feedback in group settings - many people find public criticism destabilising and need privacy to process constructively. "Open door" policies - which only work for people who are already comfortable stepping through the door.

She also talked about recognising how different team members respond when pressure rises - some withdraw and go quiet, some escalate everything to urgent, and some people-please and agree to everything until they burn out. Being able to recognise these patterns in your team is a useful skill.

Get Creative with Power BI - Making Core Visuals Shine

Valerie Junk

This session was about visualisation and report design in Power BI. A lot of it was fairly straightforward, but there were a few practical tips I picked up.

On tables specifically: use data bars (but not for every column), only highlight what's really important rather than colour-coding everything (just the top and bottom 3, perhaps - I'm definitely guilty of over-colouring), and grey out less important information. Though it's worth noting that heavy conditional formatting can be terrible for accessibility, so having an option to toggle it off is a good idea.

On that note, she showed a neat technique for toggling conditional formatting on and off using a button slicer, without resorting to bookmarks (which, in my experience, can be a bit of a nightmare). The approach is:

Create a table with two values: "Formatting" and "No Formatting"
Add a button slicer connected to that table
Update your conditional formatting rules to check SELECTEDVALUE of that column

Simple, but effective - and avoids the fragility that comes with bookmark-based approaches.

Day 4

Day 4 was the final day, and featured two talks from my colleague Barry Smart.

Spark Unplugged: How In-Process Analytics Is Making Distributed Computing An Expensive Investment

(Barry Smart - endjin)

Barry's first talk challenged the assumption that distributed computing (Spark, in particular) is always the right tool for analytical workloads. With in-process analytics engines like DuckDB and Polars becoming increasingly capable, there's a growing argument that for many workloads, you can get comparable (or better) performance without the overhead and cost of a distributed compute cluster. If you're interested in this topic, keep an eye out for a follow-up post!

No-Compromise Data Apps: Why Streamlit is the Missing Piece in Your Analytics Stack

(Barry Smart - endjin)

Barry's second talk focused on Streamlit - a Python framework for building interactive data applications. The session made the case that Streamlit fills an important gap in the analytics stack: the space between a Jupyter notebook (great for exploration, not great for sharing) and a full web application (powerful, but heavy to build). If you're a data team looking to get interactive tools into the hands of business users without a full frontend development effort, it's well worth a look!

Overall

SQLBits is a well-run conference with a good mix of deep technical content, practical sessions, and softer topics around wellbeing, inclusion, and career development. I came away with a lot of things to think about and try out - which is about all you can ask from a conference, really.

If you're working in the data space - whether that's SQL Server, Fabric, Power BI, or anything in between - it's well worth considering. The 2026 Cartoon theme was a nice touch too!

Multi-layer Caching with the Decorator Pattern

Jonathan George — Fri, 01 May 2026 05:30:00 GMT

TL;DR; Querying a Databricks SQL Serverless endpoint for analytical data is fast once the cluster is warm, but cold-start latency and query execution times make it unsuitable as the direct backing store for a web API. We solved this with two layers of caching: Azure Blob Storage for persistence across restarts, and IMemoryCache for sub-millisecond in-process reads, implemented cleanly using the Decorator pattern.

The Performance Challenge

As part of a recent project, we were building an analytical web API that serves sales data: products, retailers, historical figures and projections, to a React front-end. The data is produced by an ETL process that runs in Databricks and writes the results to Delta tables in a data lake. The natural way to query that data is through Databricks SQL Serverless: it handles the complex analytical workloads, scales well, and integrates cleanly with the rest of the stack.

There's a catch, though. Databricks SQL Serverless clusters can be paused when idle, and cold-start latency can add several seconds—sometimes tens of seconds to the first request after a period of inactivity. Even on a warm cluster, query execution time for some of the larger datasets runs into multiple seconds. For a web API that needs to feel responsive, that's a problem.

The good news is that most of the data retrieved from Databricks is reference data: it changes infrequently, and when it does change, it changes in a controlled, versioned way. That observation is what makes aggressive caching safe, and it's the insight that shaped the approach described here.

Key observation: the application works with specific named versions of the sales data. Within a given version, the data is completely immutable. There's no risk of serving stale data from a cache, because the data simply doesn't change once a version is published.

Understanding the Data Access Requirements

All data access in the application flows through a single interface, ISalesDataRepository. Here's a trimmed version of its key methods:

public interface ISalesDataRepository
{
    Task GetLatestVersionIdAsync();
    Task GetVersionAsync(string versionId);
    Task GetProductsAsync(string versionId);
    Task GetRetailersAsync(string versionId);
    Task GetSalesAsync(string versionId, DateRange dateRange);
    Task GetSalesByIdsAsync(string versionId, string[] ids);
}

The versioning model is central to everything. A version is created by the ETL process and is immutable once published. The application operates within a specific version, most requests include a versionId that scopes the data being retrieved.

Not all data falls into the same category, though. When we analyse the methods, three distinct types emerge:

Fully immutable within a version — products, retailers, and other reference data. Once fetched for a given version, these can be cached indefinitely. They will never change.
Near-real-time — the latest version ID. This needs a short time-to-live (we use five minutes) to pick up new versions when they're published without hammering the source on every request.
On-demand lookups — targeted queries like GetSalesByIdsAsync, where the combination of parameters is effectively unbounded. Caching these isn't meaningful; they always go direct to Databricks.

That analysis drives a selective caching strategy rather than a blanket one. Not everything is worth caching, and not everything can be cached with the same TTL.

The Decorator Pattern: A Primer

If you haven't used the Decorator pattern before, the idea is straightforward. A decorator implements the same interface as the class it wraps. It adds behaviour: before, after, or instead of, delegating calls to the inner implementation. The consumer doesn't know, or care, whether it's talking to a "real" implementation or a decorator. It just uses ISalesDataRepository.

This is a natural fit for caching. The actual data access code stays clean and focused on talking to Databricks. The caching logic lives entirely in the decorators. The two concerns don't touch each other.

In our case, the chain looks like this:

Client
  └─► MemoryCachingSalesDataRepository             (Layer 2: in-process, sub-millisecond)
        └─► BlobStorageCachingSalesDataRepository  (Layer 1: shared, persistent, fast)
              └─► SalesDataRepository              (Real implementation: Databricks SQL)

Each layer wraps the one below it. The MemoryCachingSalesDataRepository doesn't know it's wrapping a blob cache, it just knows it has an ISalesDataRepository to delegate to when it misses. Dependency injection wires the chain together; the decorators themselves have no knowledge of each other.

Layer 1: Azure Blob Storage Cache

Why Blob Storage?

An in-process memory cache is lost when the API restarts or when a new instance spins up. Our application runs in Azure Container Apps, which scales out to multiple replicas and restarts during deployments. Without a persistent cache layer, every new instance would need to hit Databricks on its first request, exactly the cold-start problem we're trying to avoid.

Blob Storage is cheap, fast for reads, and shared across all replicas. It's not as fast as in-process memory, but it's orders of magnitude faster than waiting for a Databricks cluster to warm up.

How It Works

BlobStorageCachingSalesDataRepository implements ISalesDataRepository and wraps the real SalesDataRepository. For each cacheable method, it constructs a deterministic blob path based on the data type, version ID, and any relevant parameters, for example, sales/{versionId}/products.bin for the products list.

The core of the implementation is a generic GetOrCreateAsync() helper:

private async Task GetOrCreateAsync(
    string blobPath,
    Func> factory)
{
    var blobClient = _containerClient.GetBlobClient(blobPath);

    if (await blobClient.ExistsAsync())
    {
        var content = await blobClient.DownloadContentAsync();
        return MemoryPackSerializer.Deserialize(content.Value.Content.ToArray());
    }

    // Cache miss: fetch from inner repository
    var result = await factory();

    // Write to blob storage for next time
    var bytes = MemoryPackSerializer.Serialize(result);
    await blobClient.UploadAsync(BinaryData.FromBytes(bytes), overwrite: true);

    return result;
}

The data is stored as binary blobs using MemoryPack, a high-performance binary serialiser for .NET. Compared to JSON, this keeps blob sizes small and deserialisation fast, both matter at the scale of multiple API instances reading from shared storage.

Preventing the Thundering Herd

A naïve implementation has a race condition that's particularly nasty during cold starts. If the blob doesn't exist and multiple requests arrive concurrently, which is exactly what happens when a new container instance starts up and the front-end fires several API calls at once, they all miss the cache and all hit Databricks simultaneously.

This is the thundering herd problem. In the worst case, you end up making dozens of parallel queries to a cluster that's still warming up.

The solution is a ConcurrentDictionary keyed by blob path. When a cache miss occurs, we acquire the semaphore for that blob path before proceeding. Inside the lock, we check the blob again, another request may have already populated it while we were waiting. If it's still a miss, we fetch from the inner repository and write the result. Here's the pattern in full:

private async Task GetOrCreateAsync(
    string blobPath,
    Func> factory)
{
    var blobClient = _containerClient.GetBlobClient(blobPath);

    // Fast path: blob already exists
    if (await blobClient.ExistsAsync())
    {
        var content = await blobClient.DownloadContentAsync();
        return MemoryPackSerializer.Deserialize(content.Value.Content.ToArray());
    }

    // Slow path: acquire per-blob semaphore to prevent thundering herd
    var semaphore = _semaphores.GetOrAdd(blobPath, _ => new SemaphoreSlim(1, 1));
    await semaphore.WaitAsync();

    try
    {
        // Double-check: another waiter may have populated the blob
        if (await blobClient.ExistsAsync())
        {
            var content = await blobClient.DownloadContentAsync();
            return MemoryPackSerializer.Deserialize(content.Value.Content.ToArray());
        }

        var result = await factory();

        var bytes = MemoryPackSerializer.Serialize(result);
        await blobClient.UploadAsync(BinaryData.FromBytes(bytes), overwrite: true);

        return result;
    }
    finally
    {
        semaphore.Release();
    }
}

Only one request per unique blob path reaches Databricks. All others wait for the semaphore, benefit from the result, and return immediately. It's worth noting that this applies per instance, if there are multiple container replicas running, each will independently populate its own copy of the blob. In practice that's fine: the first request to any instance pays the Databricks cost; subsequent requests benefit from the cached blob. If it does become a problem, there are more complex solutions using shared locks, for example, Corvus.Leasing which uses Azure blob storage to provide a means to acquire, release and extend exclusive leases to mediate resource access in distributed processing.

What's Deliberately Not Cached Here

Version lookups, GetLatestVersionIdAsync and GetAvailableVersionsAsync always go to the source. We want the application to notice when a new version is published within a reasonable time. Caching these in blob storage would give us no meaningful benefit over the in-memory TTL we apply at the layer above.

Targeted on-demand lookups by ID also bypass the blob cache. The combination space, different sets of IDs against different versions, is too large to cache meaningfully.

Graceful Degradation

All blob I/O is wrapped in try/catch. If the cache layer fails for any reason—transient connectivity, permissions, a bad serialisation, it logs a warning and falls through to the inner repository. The application keeps working; it's just slower until the cache is warm again.

Layer 2: In-Memory Cache

Why a Second Layer?

Even a fast Blob Storage read involves a network round-trip and deserialisation overhead. For a busy API serving the same reference data many times per second, that adds up. IMemoryCache keeps deserialised objects in the process's heap. Reads are effectively instantaneous, no network, no deserialisation, just a dictionary lookup.

How It Works

MemoryCachingSalesDataRepository wraps the Blob Storage decorator (which in turn wraps the real repository). It uses the standard cache.GetOrCreateAsync() pattern, with expiry configured per data type:

Immutable data within a version: no expiry, held in memory until the process restarts.
Latest version ID: five-minute sliding expiry, so new versions are picked up in a timely manner.

Cache keys incorporate the version ID where relevant, so different versions don't collide.

Here's a representative method:

public async Task GetProductsAsync(string versionId)
{
    using var activity = _activitySource.StartActivity("GetProducts");

    var cacheKey = $"{ProductsCacheKeyPrefix}{versionId}";

    if (_cache.TryGetValue(cacheKey, out Product[]? cached))
    {
        activity?.SetTag("cache.hit", true);
        return cached!;
    }

    activity?.SetTag("cache.hit", false);

    var result = await _innerRepository.GetProductsAsync(versionId);
    _cache.Set(cacheKey, result);

    return result;
}

Observability with Activity Source

Each method creates an Activity via ActivitySource, which participates in distributed tracing through OpenTelemetry. We record whether the request was a cache hit or miss as a tag on the activity: cache.hit = true/false.

This turns out to be genuinely useful. When we look at the observability dashboard, we can see at a glance what proportion of requests are being served from memory versus falling through to lower layers. It's how we validated that the cache was actually working as expected after deployment.

What Doesn't Get Cached In Memory

Targeted on-demand lookups by ID, as with the blob layer, always delegate to the inner repository. The combination space makes in-memory caching impractical, we'd risk holding enormous amounts of data with a very low hit rate.

Wiring It Together with Dependency Injection

The decorator chain is composed in the service registration. The order matters: each decorator needs to wrap the layer below it, not the one above. We register the concrete types first, then register the ISalesDataRepository abstraction as the fully-composed outermost decorator:

// Innermost: the real Databricks implementation
services.AddSingleton();

// Middle layer: Blob Storage cache wrapping the real implementation
services.AddSingleton(sp =>
    new BlobStorageCachingSalesDataRepository(
        sp.GetRequiredService(),
        sp.GetRequiredService(),
        sp.GetRequiredService>(),
        sp.GetRequiredService>(),
        sp.GetRequiredService()
    )
);

// Outermost: in-memory cache wrapping the blob storage cache
// This is what consumers receive when they depend on ISalesDataRepository
services.AddSingleton(sp =>
    new MemoryCachingSalesDataRepository(
        sp.GetRequiredService(),
        sp.GetRequiredService(),
        sp.GetRequiredService()
    )
);

Any component that depends on ISalesDataRepository via the DI container automatically gets the fully-composed chain. The decorators themselves have no knowledge of how they're composed, they just know they have an ISalesDataRepository to delegate to.

Tracing a Request Through the Cache Layers

Let's walk through what happens for a call to GetProductsAsync under three different scenarios.

First Request After Deployment (Cold Everything)

MemoryCachingSalesDataRepository — cache miss; delegates to inner.
BlobStorageCachingSalesDataRepository — blob not found; acquires semaphore; delegates to inner.
SalesDataRepository — queries Databricks SQL. If the cluster has been idle, this may take several seconds while it warms up.
The result flows back up: written to Blob Storage, then stored in the in-process cache.

This is the expensive path. It only happens once per unique dataset per application instance.

Second Request in the Same Process

MemoryCachingSalesDataRepository — cache hit; returns immediately from memory.

That's it. Sub-millisecond response time regardless of what Databricks is doing.

First Request After a Replica Starts (or a Process Restart)

MemoryCachingSalesDataRepository — cache miss (new process, empty memory cache).
BlobStorageCachingSalesDataRepository — blob found; deserialises and returns.
The result is held in memory for all subsequent requests.

The new instance pays a Blob Storage round-trip on its first request for each dataset, but never needs to hit Databricks. The warm-up time for a new replica is a handful of Blob Storage reads rather than a cluster cold-start.

Results and Trade-offs

What We Gained

After the initial warm-up, the vast majority of read requests are served from the in-memory cache in sub-millisecond time. New replicas warm quickly from Blob Storage without touching Databricks. The thundering herd problem is eliminated: Databricks is hit at most once per unique dataset per version per application instance, regardless of how many concurrent requests arrive.

Honest Trade-offs

It would be misleading to present this as a straightforward win with no downsides. There are real trade-offs:

Complexity. We now have three classes where one might seem simpler at first glance. Each decorator is individually straightforward, but the chain requires understanding to navigate.
Staleness by design. The latest version ID has a five-minute lag. The application has to tolerate that, and the product team has to accept it. In our case that's fine, new versions aren't published on a minute-by-minute basis, but it's a deliberate constraint.
Cache invalidation is implicit. When a new version is published, the old version's blobs remain in Blob Storage, they're just never requested for that version again. A separate cleanup process could remove them if storage cost becomes a concern, but for now the cost is negligible.
Memory pressure. Keeping large datasets in IMemoryCache indefinitely is a deliberate choice that works because our process's memory budget accommodates it. For larger datasets or more memory-constrained environments, you'd want to think carefully about size limits and eviction policies.
The Blob Storage layer adds latency on cold reads. If the Databricks cluster happens to be warm when a blob is missing, going via Blob Storage is actually slower than going direct. In practice, the cluster being warm on a cold-start scenario is the exception rather than the rule, but it's worth being aware of.

As always, the answer is "it depends." This approach made sense for our workload profile. For a different set of constraints, larger datasets, more frequent version changes, tighter memory budgets — some of these trade-offs might tip the other way.

Conclusions

The Decorator pattern is a clean fit for layered caching because it keeps caching logic entirely separate from data access logic. Adding a new cache tier is additive, it doesn't require changes to existing classes. The chain is composed by the DI configuration, not by the decorators themselves.

The design decisions that made this work were:

Understanding which data is truly immutable. The versioning model gave us a strong guarantee that made aggressive, indefinite caching safe.
Choosing the right storage tier for each layer. Blob Storage for persistence and cross-replica sharing; IMemoryCache for the fast path.
Protecting against the thundering herd. The semaphore-based double-check at the Blob Storage layer is easy to overlook but critical at cold start.

Databricks SQL Serverless is a powerful analytical query engine. The trick is to use it for what it's good at, processing and transforming large analytical datasets, and let fast caches absorb the high-frequency, low-latency reads that a web API demands. The Decorator pattern gives us the architectural seam to do that cleanly.

The same pattern applies well beyond Databricks. Anywhere you have a slow or expensive data source serving data that changes infrequently, layering caches using decorators is a maintainable and extensible approach worth considering.

If you've got any questions or would like to discuss anything we've talked about, please feel free to leave a comment below.

Fabric Performance Benchmarking - Spark versus Python Notebooks

Barry Smart — Wed, 22 Apr 2026 05:00:00 GMT

TL;DR — This post analyses the results of benchmarking different data processing engines on Microsoft Fabric. We compare Pandas, PySpark, Polars, and DuckDB across various compute configurations. The results provide concrete, Fabric-specific evidence for a broader industry trend: for medium-scale datasets (anything up to ~100GB), modern in-process engines like DuckDB and Polars on single-node Python notebooks are consistently faster and up to 5x cheaper than distributed Spark clusters. The code used to generate the benchmark is available in a public repo on GitHub.

endjin/fabric-performance-benchmark

The Broader Context: A Platform Responding to a Shift

This benchmarking study doesn't exist in isolation. It is the latest instalment in our Adventures in Polars and DuckDB series, in which we have been making the case that a fundamental shift is underway in how organisations approach analytical data processing.

The argument, which we have explored in depth across both series, centres on what Hannes Mühleisen (co-creator of DuckDB and Professor of Data Engineering at the University of Nijmegen) calls the "data singularity": the point at which the processing power of mainstream single-node machines surpasses the requirements of the vast majority of analytical workloads. We introduced this concept in DuckDB: The Rise of In-Process Analytics and Data Singularity and revisited it from a Polars perspective in Why Polars Matters for Decision Makers.

The core observation is straightforward: CPU core counts, RAM, and NVMe storage throughput have all improved dramatically over the past decade, while the size of most useful analytical datasets has grown far more slowly. Amazon's own internal Redshift telemetry, analysed by MotherDuck, suggests we are already close to the singularity — the 99th percentile of datasets in a production big data platform fits comfortably on a modern laptop. Their data suggests we are spending around 94% of query dollars on computation that doesn't actually need big data infrastructure.

The problem, historically, was that most data tools were designed before this shift. They could not take advantage of modern hardware capabilities. That gap gave rise to a new generation of in-process analytics engines which are built to exploit the full potential of a single, well-resourced machine. Two tools stand out: DuckDB for those who prefer SQL, and Polars for those who prefer a DataFrame API. Both re-engineer the analytics stack from the ground up: column-oriented storage, vectorized execution across all available CPU cores, and intelligent query planning that eliminates unnecessary work. Neither requires a cluster. Neither has network overhead, authentication complexity, or the coordination costs of distributed systems.

The practical implication, and the hypothesis this benchmarking study was designed to test, is that for the majority of enterprise analytical workloads, these tools running on a single well-resourced node will outperform Spark at a fraction of the cost.

Microsoft Fabric's Response: The Python Notebook

It is telling that Microsoft has recognised this shift at a platform level. The introduction of Python Notebooks to Microsoft Fabric is a direct response to the in-process analytics movement.

Where Fabric's Spark Notebooks provision a distributed cluster on demand (with all the associated overhead in spin-up time, coordination cost, and capacity consumption) Python Notebooks provide a single, configurable execution node. They come pre-installed with both DuckDB and Polars, a notable design choice that signals Microsoft's acknowledgement that these tools have earned their place in the enterprise data stack. Microsoft explicitly recommends both as alternatives to Pandas for memory-intensive workloads.

In parallel, both DuckDB and Polars have added direct support for OneLake — the storage platform that underpins Fabric. DuckDB's native Delta extension enables querying Delta tables stored in a Fabric Lakehouse via delta_scan(), reading directly from OneLake paths without additional configuration. Polars similarly supports reading and writing Delta tables via pl.read_delta(), pl.scan_delta(), and df.write_delta(), with OneLake authentication handled through Fabric's notebookutils. Both tools also support standard ABFS paths, enabling direct interaction with raw files in the Lakehouse Files area. We have covered these integration patterns in detail in DuckDB Workloads on Microsoft Fabric and Polars Workloads on Microsoft Fabric.

The Python Notebook is configurable across a range of single-node sizes (2, 4, 8, 16, 32, and 64 vCores, with memory scaling proportionally). This is a meaningful range: a 32-vCore Python Notebook gives DuckDB or Polars access to substantially more parallelism than even a well-resourced Spark executor, without any of the coordination overhead of distributed execution.

What follows is our attempt to put hard numbers behind these claims, using a realistic enterprise workload on real Fabric infrastructure.

Data Source

The use case is implemented using open data provided by the UK Land Registry House Price Data open data repository, made available under an Open Government Licence.

The data is provided as a set of CSV files (one per calendar year) which have been downloaded to a Fabric Lakehouse. The dataset covers property sales since 1995, with approximately 1 million transactions per year. In total, 30 years of historic data amounts to roughly 30 million rows and ~5GB of raw CSV.

This scale is deliberate. We find that many published benchmarks focus on processing datasets that are rarely encountered in practice. Our objective is to focus on the scale we more commonly encounter in enterprise client engagements — medium-scale datasets that are interesting enough to stress-test the differences between engines, but representative of the workloads that most data teams actually run day to day. The data fits in memory for all configurations tested, which is precisely where in-process engines are designed to excel.

Use Case

The use case mimics a common set of data transformations typical for data of this nature:

Start Up & Set Up — provisioning the platform (Spark or Python), then completing startup tasks such as importing Python packages.
Ingestion & Transform — reading raw data from a set of CSV files, standardising, cleaning, and adding derived features.
Write Dimensional Model — writing different slices of the transformed data to the Lakehouse in Delta format for downstream consumption, in this case as a dimensional model for Power BI.
Read and Summarise — reading the Delta tables back and running analysis based on filtering, joining, and summarising data across different dimensions.
Benchmark Capture & Clean Up — capturing timestamps and memory consumption metrics at each stage and persisting them to the Lakehouse for analysis.

flowchart TB
A["1. Start Up & Set Up"] --> B["2. Ingestion & Transform"]
B --> C["3. Write Dimensional Model"]
C --> D["4. Read & Summarise"]
D --> E["5. Benchmark Capture & Clean Up"]

This pipeline exercises the full range of operations that matter to data engineers: reading raw files at scale, executing complex transformations, writing Delta tables, and querying structured data back. It is a fair test of what each engine is actually optimised to do.

Fabric Platforms

Fabric offers multiple notebook environments. For this study we used two:

Spark notebooks — a notebook experience over an on-demand Spark cluster, configurable in terms of vCores, memory, and number of executor nodes. Enables polyglot development (Python, R, SQL) across a distributed compute environment.
Python notebooks — a more recent addition to Fabric. Python notebooks provision a single execution node sized according to pre-defined configurations (vCores and memory). Whilst positioned for "smaller" workloads, we find that the majority of enterprise use cases can be comfortably accommodated on this platform when the right tooling is chosen.

Workloads

The study compares four data processing engines running the same use case on Fabric:

Pandas — the default Python library for data engineering. Vast ecosystem, but single-threaded and constrained to datasets that fit in memory. Serves as a baseline.
PySpark — the Python API for Apache Spark. Designed for distributed computation across clusters and widely deployed via Databricks and Azure Synapse. The incumbent choice for enterprise-scale data engineering.
Polars — a Rust-based engine with a Python API. Designed from the ground up to exploit modern hardware: automatic parallelisation across all available cores, lazy evaluation with query plan optimisation, and memory-efficient columnar processing. We explored Polars' technical foundations in detail in Polars Technical Deep Dive.
DuckDB — a C++ in-process analytical database with a Python API. Optimised for OLAP workloads, capable of querying CSV files and Delta tables directly, and able to use disk for larger-than-memory datasets. We covered DuckDB's internals — columnar storage, vectorized execution, zone maps — in DuckDB In Depth: How It Works and What Makes It Fast.

Fabric Capacity and Capacity Units (CUs)

A Fabric capacity is a dedicated pool of compute resources purchased from Azure — a fixed amount of computational horsepower continuously available to workspaces assigned to it. When you purchase a Fabric capacity (e.g. F8, F64), you are reserving that number of capacity units (CUs) for continuous use across all workloads: notebooks, pipelines, warehouses, Power BI, and so on.

CUs are Fabric's abstraction layer for billing compute across heterogeneous workloads. Different engines have different conversion rates, meaning your F64 capacity represents different amounts of practical compute depending on which engine is consuming it.

Consumption is measured in CU Seconds: the number of CUs consumed multiplied by duration in seconds. The fundamental formula for both Spark and Python Notebooks in Fabric is 0.5 CU per second per vCore, though for Spark the total vCore count must account for driver and all executor nodes:

A Python Notebook sized at 8 vCores consumes 4 CUs per second.
A Spark Notebook with 1 driver (8 vCores) and 1 executor (8 vCores) consumes 8 CUs per second.

This billing asymmetry is significant and, as the results below will show, it creates a strong economic case for Python Notebooks when the workload doesn't genuinely require distributed execution.

Configurations

Achieving direct parity across Spark and Python notebook platforms is not straightforward. We opted for configurations that allow a range of CU-per-second comparisons, including some like-for-like points:

CUs Per Second	Python Notebook Configuration	Spark Pool Configuration
1	2 vCores, 16G RAM [Default for Python]
2	4 vCores, 32G RAM
4	8 vCores, 64G RAM	1 Executor 4/4 vCores 28G/28G RAM
6		2 Executors 4/4 vCores 28G/28G RAM
8	16 vCores, 128G RAM	1 Executor 8/8 vCores 56G/56G RAM [Default for Spark]
10		4 Executors 4/4 vCores 28G/28G RAM
12		2 Executors 8/8 vCores 56G/56G RAM
16	32 vCores, 256G RAM
20		4 Executors 8/8 vCores 56G/56G RAM

The default Python Notebook (2 vCores, 16GB RAM) is 8x cheaper per second to run than the default Spark Notebook (1 Executor 8/8 vCores 56G/56G RAM). The smallest Spark configuration (1 Executor 4/4 vCores 28G/28G) is equivalent in CU cost to the 8-vCore Python Notebook which provides ~2.5x the RAM, and as the results show, very significant computational throughput when running DuckDB or Polars.

Spin-up times are also materially different, and matter for development workflows. Python Notebooks at the default 2-vCore configuration provision in under 30 seconds. Spark clusters in this study took between 3 and 3.5 minutes at all configurations tested:

Platform	Configuration	Median Provisioning Time
Fabric PySpark Notebook [Default for Spark]	01 executors 08/08 cores 56g/56g memory	18.6
Fabric Python Notebook [Default for Python]	02 vCores	24.356
Fabric Python Notebook	16 vCores	125.15
Fabric Python Notebook	08 vCores	129.215
Fabric Python Notebook	32 vCores	132.767
Fabric Python Notebook	04 vCores	136.243
Fabric PySpark Notebook	02 executors 08/08 cores 56g/56g memory	188.749
Fabric PySpark Notebook	04 executors 08/08 cores 56g/56g memory	192.767
Fabric PySpark Notebook	01 executors 04/04 cores 28g/28g memory	194.956

When developing directly in Fabric notebooks we use the default platform configurations operating on test data. Then scale up the configuration as needed in production.

With DuckDB and Polars, we favour local development given the access this gives us to modern IDEs, coding agents and unit testing frameworks. Faster provisioning directly improves the developer inner loop. Shorter iteration cycles during development compound over the course of a project. This is consistent with our experience migrating client workloads from Spark to in-process engines, where test suite runtimes dropped from minutes to seconds and local development became a practical reality again.

Methodology

Multiple runs were completed for each combination of engine, workload, and configuration. Median execution times are reported throughout to reduce the influence of outliers.

Analysis

Elapsed Time Analysis

Elapsed time analysis includes the time to spin up the required Spark or Python environment. Environments with faster provisioning therefore have an inherent advantage here. Note that spin-up time does not incur a CU cost on Fabric — if cost is your primary concern, skip ahead to the Execution Time Analysis.

The table below shows median elapsed times for each workload on its default environment:

Platform	Configuration	Workload	CUs Per Second	Median Elapsed Time	Percentage of Min Elapsed Time
Fabric PySpark Notebook	01 executors 08/08 cores 56g/56g memory	pyspark_benchmark	8	125.691	100
Fabric Python Notebook	02 vCores	duckdb_benchmark	1	133.59	106.3
Fabric Python Notebook	02 vCores	polars_benchmark	1	174.876	139.1
Fabric Python Notebook	02 vCores	pandas_benchmark	1	275.515	219.2

Key observations:

PySpark (on its default 8-CU environment) is the fastest at elapsed time (~126 seconds), but only marginally so.
DuckDB on the default Python Notebook (1 CU per second) runs in ~134 seconds — just 6% slower — at one-eighth the CU cost.
Polars on the same minimal configuration completes in ~175 seconds, also at one-eighth the cost.
Pandas, the incumbent default, takes over four minutes — roughly twice Spark's elapsed time.

Execution Time Analysis

Execution time excludes spin-up and environment provisioning, enabling a direct comparison between engines across configurations at equivalent CU costs.

The Top 10 results are presented below:

DuckDB achieves the fastest pure execution time, on an 8-vCore Python Notebook with 64GB RAM.
DuckDB and Polars occupy 8 of the top 10 positions before Spark appears.
The fastest Spark execution time is more than twice that of DuckDB — which achieves its best result on infrastructure with half the CU cost.

Rank	Platform	Configuration	Workload	CUs Per Second	Median Execution Time	Percentage of Min Execution Time
1	Fabric Python Notebook	08 vCores	duckdb_benchmark	4	47.037	100
2	Fabric Python Notebook	04 vCores	duckdb_benchmark	2	66.405	141.2
3	Fabric Python Notebook	16 vCores	duckdb_benchmark	8	72.794	154.8
4	Fabric Python Notebook	32 vCores	duckdb_benchmark	16	80.182	170.5
5	Fabric Python Notebook	16 vCores	polars_benchmark	8	86.28	183.4
6	Fabric Python Notebook	08 vCores	polars_benchmark	4	88.203	187.5
7	Fabric Python Notebook	32 vCores	polars_benchmark	16	90.787	193
8	Fabric Python Notebook	04 vCores	polars_benchmark	2	102.398	217.7
9	Fabric PySpark Notebook	01 executors 08/08 cores 56g/56g memory	pyspark_benchmark	8	107.091	227.7
10	Fabric Python Notebook	02 vCores	duckdb_benchmark	1	107.871	229.3

The line chart below shows median execution times across all engines and configurations, with CUs per second as the common measure of both environment size and cost:

Key observations:

Execution time for all engines initially decreases as more cores and memory become available, but there is a clear "sweet spot" beyond which performance plateaus or even degrades slightly. Throwing more infrastructure at a problem does not guarantee faster results and can actually make things slower.
At comparable CU levels (e.g. 4 CUs per second), DuckDB and Polars on Python Notebooks significantly outperform PySpark on Spark Notebooks.

This "sweet spot" behaviour is consistent with what we know about how DuckDB and Polars are engineered. Both tools use automatic parallelisation across available cores, but there is an overhead to thread coordination that grows with core count. Beyond the point where all available parallelism is fully utilised, adding more cores yields diminishing returns.

CU Cost Analysis

Shifting focus from execution time to cost, the following table lists the 10 cheapest engine/configuration combinations:

Workload	Configuration	CUs Per Second	Median Execution Time	Total Cost (CUs)	Percentage of Min Cost
duckdb_benchmark	02 vCores	1	107.871	107.871	100
duckdb_benchmark	04 vCores	2	66.405	132.81	123.1
polars_benchmark	02 vCores	1	156.885	156.885	145.4
duckdb_benchmark	08 vCores	4	47.037	188.148	174.4
polars_benchmark	04 vCores	2	102.398	204.796	189.9
pandas_benchmark	02 vCores	1	250.84	250.84	232.5
polars_benchmark	08 vCores	4	88.203	352.812	327.1
pandas_benchmark	04 vCores	2	224.122	448.244	415.5
duckdb_benchmark	16 vCores	8	72.794	582.352	539.9
pyspark_benchmark	01 executors 04/04 cores 28g/28g memory	4	149.619	598.478	554.8

The cheapest Spark run costs more than 5x the cheapest DuckDB run and approximately 4x the cheapest Polars run. Even Pandas on the default minimal Python Notebook is cheaper than the most economical Spark configuration.

Execution Time versus Cost

The following scatter chart provides a two-dimensional view of all engine/configuration combinations tested. The horizontal x-axis shows execution time; the vertical y-axis shows cost on a logarithmic scale.

The sweet spot (best combination of fast execution and low cost) lies in the bottom-left quadrant. That space is dominated by DuckDB and Polars.

Stage Analysis: What Separates DuckDB from Polars?

Both DuckDB and Polars are modern in-process query engines built for exactly this kind of workload, yet DuckDB consistently comes out ahead. The stage-level analysis reveals where that advantage is won.

DuckDB's edge appears consistently in the stages that involve reading from the Fabric Lakehouse — specifically, reading Delta format tables. DuckDB uses its own native Delta extension for this purpose, reading directly from OneLake without additional dependencies. Polars, by contrast, currently uses the delta-rs package for Delta reads.

This is an important distinction in the context of Microsoft Fabric's architecture. OneLake stores data in Delta format, and any engine that can query Delta tables natively has a structural advantage. It is worth watching whether Polars develops its own native Delta reader over time; if it does, the gap between the two engines may narrow.

Conclusions

This benchmarking study provides concrete, Fabric-specific evidence for what we have been arguing across the broader series: in-process analytics engines running on single-node infrastructure are a serious and, for most workloads, superior alternative to distributed Spark.

Modern in-process engines outperform Spark for medium-scale workloads — DuckDB and Polars delivered faster execution times than PySpark across all comparable configurations, often by a factor of 2x or more. The claims we made in our DuckDB and Polars series hold up under real Fabric workloads.
Cost efficiency strongly favours Python Notebooks with modern engines — the cheapest Spark configuration costs 4-5x more than the cheapest DuckDB run for equivalent work. For teams with finite Fabric capacity budgets, this is a material consideration. Profile your workload, start small, and scale to Spark only when you genuinely need it.
More resources don't always mean faster execution — there is a "sweet spot" for resource allocation beyond which performance plateaus or degrades. This challenges the intuition that scaling infrastructure will proportionally improve throughput. Both DuckDB and Polars are efficient enough that 4-8 vCores often delivers the best balance of speed and cost.
Default configurations are not equal — and the gap matters — the default Python Notebook (2 vCores) is 8x cheaper per second to run than the default Spark Notebook, yet delivers comparable elapsed-time performance when using DuckDB or Polars. For development workloads, the Python Notebook should be the default choice.
DuckDB's native Delta reader provides a measurable edge — the stage analysis suggests DuckDB's advantage over Polars comes primarily from its native Delta reading capability. This is a meaningful finding in the context of Fabric, where Delta is the dominant table format. It reinforces one of DuckDB's core design principles: eliminate friction wherever data meets compute.
Microsoft is responding to the in-process analytics movement — the introduction of Python Notebooks pre-installed with DuckDB and Polars, combined with OneLake support in both tools, signals that the platform is evolving to accommodate this shift. Teams investing in these tools are aligned with the direction of travel, not swimming against it.

Note - all code used to generate this benchmark and the supporting analysis is available in a public repo on GitHub.

endjin/fabric-performance-benchmark

The data singularity is not a theoretical future state — it is arriving now, and platforms like Microsoft Fabric are starting to reflect that. For most enterprise analytical workloads, the question is no longer whether single-node in-process engines can handle the job. Based on the evidence here, they can deliver a faster and cheaper solution than the distributed alternative.

So which workloads do still justify Spark's overhead? Reserve Spark for datasets that genuinely exceed single-node memory capacity, or for workloads where Fabric-specific Spark optimisations (V-ORDER, Liquid Clustering) are demonstrably valuable. For everything else, DuckDB and Polars on a Python Notebook are the pragmatic choice.

The good news is that Microsoft Fabric makes all of these compute options available to you, underpinned by Delta format and OneLake as the common storage layer. So you are not forced to make a choice up front, you have the flexibility to adapt without being forced to move data or adopt a different platform.

Medallion Architecture in Excel

James Broome — Tue, 21 Apr 2026 05:30:00 GMT

The three-tab rule

I've worked in technology for a long time. In fact, I've had a 25-year long career in software and data, which means I've had to use a lot of different programming languages, frameworks, platforms and tech stacks. But it's also meant that I've used Excel a lot. I would describe myself as fairly competent in Excel. Not world championship level by any means, and probably not even Joel Spolsky level (although I did start to suck slightly less after watching this 10 years ago). But fairly competent.

However, I came across an interesting article recently that really grabbed my attention. This post on How-To Geek describes a "three-tab rule" in Excel, separating source data, logic, and presentation. It's framed by making comparisons to the Model-View-Controller (MVC) pattern in web development, something I'm very familiar with, but had never considered applying to Excel. I was pretty surprised that I'd never come across this before (especially as I've spent a lot of time thinking about engineering practices in software), but I was also immediately struck by how useful, sensible and simple it was.

The pattern described proposes structuring Excel workbooks using three distinct tabs: Source, Logic, and Interface. The premise is straightforward - most spreadsheets fail because they mix raw data, calculations, and final reports on the same screen. By separating these concerns, you create workbooks that are easier to audit, maintain, and scale.

The MVC-based pattern works like this: the Source tab holds your raw, unmodified data in a structured format (ideally as an Excel Table). The Logic tab does all the heavy lifting - transformations, calculations, lookups, using modern Excel functions like FILTER, SORT, and LAMBDA. The Interface tab presents the final, polished output that stakeholders actually see.

This resonates strongly with how we think about data architecture at endjin. But while the original article frames this through the Model-View-Controller pattern, I think there's a more relevant mental model from the data engineering world that fits this use case perfectly.

From MVC to Medallion

If you've worked with modern data platforms - whether that's Databricks, Microsoft Fabric, or Azure Synapse - you'll likely be familiar with the Medallion Architecture. As Carmel described in her recent deep-dive on the topic, it's a data design pattern that consists of three tiers: Bronze (raw), Silver (cleaned and validated), and Gold (projected for specific use cases).

The parallel to the three-tab Excel rule is obvious:

Medallion Tier	Excel Tab	Purpose
Bronze	Source	Raw data in its original form. No transformations, no cosmetic formatting. A historical archive of what was received.
Silver	Clean	Cleaned, validated, and structured data. Schema standardisation. This is where raw data becomes useful information.
Gold	Output	Transformed and projected for a specific use case. Logic and calculations applied. Polished, formatted, and ready for consumption by stakeholders. This might be a pivot table, a chart, or another targeted, tabular presentation.

This isn't just a semantic rebrand. The Medallion Architecture brings with it a wealth of best practices from the data engineering community - around data quality, validation, lineage, and governance that can inform how you approach your data workloads, even in Excel.

The benefits of this approach

When you start thinking about your Excel workbook as a miniature data pipeline, several good practices naturally follow.

Data lineage becomes visible. With clear separation, you can trace any value in your Output tab back through the Clean tab to its origin in the Source tab. When a number looks wrong, you know where to look.

Validation can be added systematically. The Clean tab becomes the natural place to add data quality checks - are there unexpected blanks, or duplicates? Do totals reconcile? Are values within expected ranges? This is the equivalent of creating quality gates as data moves from Bronze to Silver.

The Source tab becomes immutable. Just as the Bronze tier in a data lakehouse preserves raw data for historical playback, your Source tab should remain untouched. If new data arrives, append it rather than overwrite. This gives you an audit trail and the ability to reprocess if your logic changes.

Multiple projections from the same source. You might need different views of the same underlying data - one for the finance team, one for operations, one for the board. In the Medallion Architecture, this is exactly what the Gold tier enables. In Excel, you can create multiple Output tabs, all drawing from the same Clean layer.

Proceed with caution

I should be clear that whilst this pattern makes Excel workbooks more robust, more maintainable, and more professional, it doesn't make Excel the right tool for every job. I hold some strong opinions about when and where Excel is appropriate, and when the stakes are high - when significant decisions are being made and errors could impact financial, regulatory compliance, or even public health outcomes, Excel isn't enough. You more than likely need proper software engineering discipline and quality control barriers across technology, process, and people. This pattern doesn't prevent someone accidentally filtering and deleting rows in the Source tab. It doesn't stop formulas breaking silently. It doesn't provide version control, automated testing, or audit logs.

But, for those situations where Excel genuinely is appropriate, or where constraints mean it's your only option, this approach represents a genuine step-change in how you structure your work.

Summary

The three-tab rule for Excel is a pattern worth adopting. It has a direct comparison to the well established Medallion Architecture, connecting your Excel work to the broader principles of modern data engineering: separation of concerns, data quality, lineage, and fit-for-purpose projections.

Of course, this approach doesn't come without limitations. Excel remains a tool designed for individual productivity, not enterprise data management. If your workbook is becoming mission-critical - if it updates regularly, if others depend on it, if errors would be costly, then it's worth asking whether you've outgrown what Excel can safely provide.

LINQ Max and nullable value types

Ian Griffiths — Fri, 17 Apr 2026 04:30:35 GMT

While working on a project for a customer, we came across a slight oddity of LINQ's Max operator when you use it with a value type. In some cases Max returns null when supplied with an empty list, and there are cases where this works even with value types—this overload returns int? for example. But in some cases it will not do this, and will instead throw an exception if its input is empty. The reasons behind it are non-obvious and somewhat subtle, so I thought I'd write about it.

The Max operator offers a projection-based overload with this signature:

public static TResult? Max(
    this IEnumerable source,
    Func selector);

This will iterate through the source, pass each item to the selector callback, and then return the highest of the values the callback returns.

Notice how although the selector function returns a TResult, the return type of Max itself is TResult?. That nullability is there to handle the case where the source enumerable is empty: in that case there is no maximum value (because there are no values at all) and that TResult? return type means Max can return null to indicate that.

But it goes a bit weird if the selector returns a value type. Suppose you've got this type (which is a reference type, but crucially, two of its properties use value types):

public record WithValues(string Label, int Number, DateTimeOffset Date);

First, let's verify that Max does what I've said with an empty list when the projection retrieves a reference type:

WithValues[] empty = [];
string? maxLabel = empty.Max(x => x.Label);
Console.WriteLine(maxLabel is null);

This prints out True, confirming that Max here does indeed return null to let us know that there was no maximum value. (The notion of a maximum string value raises the awkward fact that Max doesn't let you pass an IComparer here, but let's ignore that for now.)

With that in mind, what type do you suppose maxDate has in this example?

WithValues[] empty = [];
var maxDate = empty.Max(d => d.Date);

If you look at the definition of Max you could correctly conclude that TSource here becomes WithValues and that TResult is DateTimeOffset. (And as we're doing all this type inference in our heads, we might reflect on whether using var here has really saved us any time and effort.) And since Max returns TResult? you might conclude that maxDate must be of type DateTimeOffset? (which is an alias for Nullable).

But that would be wrong. Here's exactly equivalent code using an explicit type declaration instead of var:

WithValues[] empty = [];
DateTimeOffset maxDate = empty.Max(d => d.Date);

It is now clear that maxDate's type is DateTimeOffset. If we were to try to declare it as a DateTimeOffset?, that would actually compile, but it wouldn't be equivalent to the var example: in the case where we use var, maxDate really does have the non-nullable DateTimeOffset type.

And if we do try to use DateTimeOffset?, it goes wrong. This compiles:

DateTimeOffset? maxDate = empty.Max(d => d.Date);
if (maxDate.HasValue)
{
    Console.WriteLine(maxDate.Value);
}
else
{
    Console.WriteLine("No dates found.");
}

but it only compiles without error because an implicit conversion is available from Max's return type of DateTimeOffset to the variable's type of DateTimeOffset?.

The most important thing to know about this code is that it will actually fail at runtime with an InvalidOperationException complaining that the Sequence contains no elements!

Earlier I linked to a non-generic overload of Max that returns an int? so you might think that this would work:

WithValues[] empty = [];
int? maxNumber = empty.Max(x => x.Number);

but this will also fail with an exception at runtime instead of returning null. In fact it ends up using a different overload that returns an int, and not the one that returns an int?.

So that's weird. The first two examples call the same single overload of Max, and yet it handles an empty list completely differently depending on whether our selector returns the Label or Date. (When it returns Number, we end up using the int-specific overload, but the fact remains that an empty list causes an exception when we select a value-typed property, but the method returns null when selecting a reference-typed property.) What's going on?

Well it turns out that this particular Max method actually has two different code paths: and it effectively uses this test to choose which path to use:

TResult val = default;
if (val == null)
...

If val == null, then it goes down the code path that returns null if the list is empty. If not, it goes down the path that throws an exception if the list is empty.

This is a deliberate design choice. If default(TResult) is something other than null—e.g. default(int) is 0—then there might be no way to tell the difference between an empty list, and a list where default(TResult) really was the maximum value. For example, in the list [-3,-2,-1,0], the maximum value is 0, so how could we distinguish between that case and the empty list case if we were getting back 0 in either case?

So there's a rationale for this behaviour, but it's not obvious that this one method can behave in two quite different ways. The documentation doesn't mention that this particular overload may throw an InvalidOperationException.

We can explore what that test will do with various types:

static void ShowNull()
{
    T? val = default;
    Console.WriteLine(val == null);
}

ShowNull();
ShowNull();
ShowNull();
ShowNull();
ShowNull();
ShowNull();

This prints out:

True
True
False
False
True
True

So this tells us that Max will consider TResult to be potentially nullable if it's a reference type like string, or if it's a nullable value type like int? or DateTimeOffset?. But plain value types like int and DateTimeOffset are considered not to be nullable.

That explains why using the x => x.Label lambda makes Max return null when the list is empty, while with d => d.Date or d => d.Number, it throws an exception. The first has a return type of string (a reference type) while the other two have non-nullable value-typed return types (DateTimeOffset and int).

But why does Max even have these two different code paths? It's perfectly possible for a method to return a DateTimeOffset?, so why does Max not do that here? If the argument for the type parameter TResult is DateTimeOffset, and the method declares a return type of TResult?, shouldn't that make the return type DateTimeOffset??

The reason it doesn't work out that way is because nullability handling for reference types was a bit of an afterthought in C#. (See my extensive series on nullable reference types for (a lot) more detail.)

In the beginning (C# 1.0) there were value types, which could not be null, and reference types, which were always capable of being null. There simply wasn't any concept of a value type being nullable, and nor was there any way to constrain a reference type to be non-null. This reflected the underlying reality of the .NET runtime's type system. Then C# 2.0 added support for nullable value types, enabling us to write int?. But this was an entirely different way of representing nullability: an int? is really a Nullable, and Nullable essentially combines a value with a bool indicating whether the value is present. This is fundamentally different from how reference types like string represent null. (This is more of a library feature than a runtime feature. OK, strictly speaking there's some special handling for Nullable when it comes to boxing and unboxing, but otherwise, this is mainly a language feature that doesn't directly reflect how the underlying runtime type system really works.) Although C# lets us write code that works with int? in ways that are (sometimes) similar to how we might work with a reference, the generated code is really quite different, and that causes challenges for generic code. And finally, C# 8.0 introduced nullability annotations for reference types, so that now, we write string? if we mean a reference that might be null whereas string is (in theory) never null, in a way that is conceptually similar to the fact that an int can never be null.

But although we've ended up in a place where there are apparently two dimensions—value vs reference, and nullable vs non-nullable—the history of how we got here means these aren't truly independent. Nullability works very differently for values vs references in practice. And two of the four combinations (nullable values, and non-nullable references) aren't really first class citizens in the .NET type system. (A nullable value in a null state looks different from the null reference value. And a 'non-nullable' reference might in fact be null.)

And this difference tends to poke out from time to time with surprising behaviour like we're seeing with this Max operator. It would be completely reasonable to expect it to deal with the Label and Date properties in exactly the same way. But the history of nullability in .NET means it doesn't work in practice.

So how do we fix this? We can use this slightly ugly hack:

DateTimeOffset? maxDate = empty.Max(d => (DateTimeOffset?)d.Date);

That cast means that the lambda's type is now Func. (Before it was Func, with a non-nullable return type.) Since default(DateTimeOffset?) == null, Max will select the code path that returns null when the input collection is empty. (It doesn't do that without this cast, because default(DateTimeOffset) is not null. It's a value representing midnight on the 1st January in the year 1, with a zero time zone offset.)

But what about that specialized (non-generic) overload of Max I linked to earlier that returns an int?? Well it turns out that it only comes into play when the selector also returns an int?. You get a different overload when the selector returns a plain int. So it ends up looking similar to the DateTimeOffset case (which used the generic overload). We need to cast to a nullable value:

int? maxNumber = empty.Max(x => (int?)x.Number);

So we can make it work how we want, it's just slightly messy. That's the reality of a 25+ year old language that has made two major changes to the nature of what it means to be null.

Returning to Work After a Career Break: How Remote Work Made It Possible

Carmel Eve — Thu, 16 Apr 2026 05:30:00 GMT

In 2021 I left work, the UK, and most of my friends and family behind. After what felt like half a decade trapped (at points literally) in a small flat in Manchester, I couldn't wait to get out and explore the world. It was an amazing, life-changing experience. But, after some years away, it was time to return home. And, in doing so, I had some decisions to make.

When I first arrived at Gatwick airport I had no idea what life was going to look like - I stayed with a friend, found some temporary work, and tried to reacclimatise to life at home. There were many reunions, tears, and more than one panic attack due to complete overwhelm - reverse culture shock is a real thing, who knew?

But, when all was said and done, though I made some vague attempts to see what was out there in the working world, it felt like there was really only one place I wanted to end up - and that was back at endjin.

I've now been back in the UK nearly two years, and back at work for 18 months, and there is definitely a lot to reflect on.

Remote Work and Digital Nomading

I know "digital nomad" can sound a bit buzzword-y, but it is the simplest label for how I have been living since I got back.

Having spent almost 3 years without staying in one place more than a month, the idea of signing a year-long lease felt terrifying in the extreme. And, endjin being a totally remote company (and having been so since 2018 - before it was cool...), I luckily didn't have to.

A lot of Airbnbs offer discounts for stays over a month, and with the current price of electricity and gas, you can often find options for far less than you'd pay at a standard rental - especially if (like me) you are drawn to places in the middle of nowhere, even in the depths of winter. So, that is what I did...

I lived all over the UK - Devon, Yorkshire, Bristol, North Wales, and even spent the winter in Spain (something that convinced me more than ever that January in the UK just isn't for me). And, in all of this, I learnt a lot about what I value in the places that I live. The feeling of being able to walk out the door into nature is something that, for me, is unparalleled. That being said, somewhere with train connections allowing me to attend the 5+ weddings I need to go to (yes, I am in my 30s) is equally important. Two months spent in a depression fog in a village in Yorkshire taught me that access to a gym or at least some way to exercise when it's raining is also a must for my mental health...

Honestly, I do not think returning to work would have been possible for me at that point without this flexibility. My mental state was not great, and trying to go straight from years of constant movement into a rigid routine would have been too big a shock. Being able to choose where I lived, reduce pressure where I could, and make changes gradually meant I could build back up rather than burn out.

Remote-first work gave me the space to re-enter life in the UK on my own terms, and that flexibility is what has made the first year back at work feel in some ways like a continuation of the adventure.

What I Want to Carry Forward

I know that not everyone has the opportunity, means, or even desire to live month-to-month unpacking and re-packing, but I do think that there are some lessons that I've learnt which are applicable whatever your situation:

Over-planning the next 1/3/5 years can make you more anxious, not less. One of the biggest things I had to accept when travelling was that nothing ever goes fully to plan. That's as true for life in general as it is for catching 4 buses in a day. Spending all my time running through every possible scenario and outcome is never as helpful as it feels in the moment. (As someone with anxiety, I know that's easier said than done. And, to be clear, I'm not saying don't plan at all - I'm told some financial planning is probably a good idea...)
You don't need to work everything out at once. Trying something and deciding it doesn't work is far better than never changing at all. Most of the time, all you need to plan is the next step.
Notice what makes you happy. Nothing makes me feel better than being in nature, so building a life around that feels not only logical, but necessary.
Also notice what drains you. For me, winter has always been hard. Once I accepted that, it became much easier to make decisions that were actually good for me.
Revisit your priorities every few months. What mattered to you last year might not be what you need now.
Don't confuse discomfort with failure. Some uncertainty is just part of change, and it does not always mean you've made the wrong decision.

And, if you are considering stepping into the world of "digital nomading", some advice:

If you are working, make sure that you have enough time to appreciate a place. The first couple of places I stayed, I was only there for a month. By the time I'd moved the first weekend and left the final one, I felt like I had barely found my feet before I was moving again. Plan a good margin between moves - back-to-back travel plus work can be exhausting very quickly.
Think about the practical - what are some of the things that you do every day / every week that you'd struggle without - a swimming pool? A gym? A library? A pub within walking distance..?
How much travel do you need to do whilst you are there? Can you find somewhere you love that doesn't mean spending 5 hours on a train multiple times per month?
Always have a backup internet option. A local SIM/hotspot can save a lot of panic on work days.
Budget for comfort, not just cost. Sometimes paying a bit more for location, heating, or a proper desk is worth it.
Remember that it can be lonely. Moving around means you don't always build a base where you are living. Make sure you know what you will do if you are feeling alone - are there friends nearby? Do you have people you can call? Are there local groups you can get involved in?

Final thought

The biggest thing I have learned is that you do not need to blow up your whole life to make things better. You can use the same approach wherever you are: pay attention to what helps, be honest about what drains you, and focus on the next sensible step instead of waiting for a perfect long-term plan. That shift has made life feel less overwhelming, work feel far more sustainable, and I'm excited about building a life that works for me - whatever that might look like!

AI-assisted coding is four decisions, not one

Mike Evans-Larah — Mon, 13 Apr 2026 23:08:00 GMT

The pace of change in the world of AI-assisted coding is overwhelming. With new tools, frameworks, and platforms emerging and evolving constantly, it can be hard to keep up, let alone understand how all the pieces fit together.

People often ask questions like "Should I use Cursor or ChatGPT?" or "Is Claude better than Copilot?" but to make decisions about which tools to use (or even ask the right sort of questions), it's important to understand the underlying architecture.

In this post, I want to share a simple mental model that has helped me make sense of the AI-assisted coding landscape.

The four layers

At a high level, we can think of AI-assisted coding as being composed of four distinct layers:

Harness: The user interface and experience for interacting with the AI. It includes things like code editors, chat interfaces, CLI tools, and the system prompts that shape the AI's behaviour.
Capabilities: The tools, instructions, skills, and context sources that extend what the AI can do — increasingly portable across harnesses.
Model: The AI model that processes input and generates tokens — text, code, images, or other outputs. This is where the "intelligence" lives.
Provider: The infrastructure and services that host and run the model. This includes cloud platforms, APIs, and the computational resources (GPUs, memory, state) needed to power it.

These layers build on each other: the harness provides the interface, capabilities extend what's possible, the model does the reasoning, and the provider supplies the compute.

Harness

The harness is what you actually interact with day-to-day, and there's a surprisingly wide range of options. Broadly, they fall into a few categories:

Chat interfaces: Web-based or desktop tools like claude.ai or ChatGPT, where you paste code in and get responses back. Great for quick questions and exploration, but limited when it comes to working with full projects.
IDE extensions: Tools like GitHub Copilot Chat or Roo Code that plug into your existing editor (VS Code, Visual Studio, JetBrains, etc.). These meet you where you already work, with direct access to your codebase.
Purpose-built IDEs: Editors like Cursor and Google Antigravity that have been built from the ground up with AI at their core (usually forks of VS Code). They offer deep integration between the editor experience and the AI capabilities.
App builders: Tools like Lovable that focus on generating entire applications from natural language descriptions, targeting less technical users or rapid prototyping.
CLI tools: Command-line agents like Claude Code, OpenCode, and Copilot CLI that let you work with AI directly from your terminal. These tend to appeal to developers who prefer keyboard-driven workflows.

Even within these categories, there has been a push recently towards different user interactions. For example, integrating voice control or continuing conversations across devices.

But the harness isn't just about where you interact with the AI, it's about how the harness shapes the AI's behaviour. Modern harnesses go well beyond simple chat, adding capabilities such as:

Agentic workflows: The ability for the AI to plan, execute multi-step tasks, spawn sub-agents, run commands, and iterate on its own output. This might happen locally in the harness, or it might hand off to cloud-based agents running on the provider layer.
System prompts: The invisible instructions that shape how the AI behaves. This is a bigger deal than most people realise. The same model can perform dramatically differently depending on the harness it's running in, because each harness ships its own system prompt.
Memory and context: Persistent memory across sessions, project-level instructions, and the ability to pull in relevant files and documentation automatically.

These harness-level capabilities can make a huge difference to your productivity, often more so than the choice of model itself. A great model in a limited harness won't perform as well as a good model in a harness that gives it the right tools and context.

Capabilities

The capabilities layer is what you layer on top of the harness to extend what the AI can do. This has arguably been one of the biggest developments in the last year. It includes:

Tools and MCP: The Model Context Protocol (MCP) has emerged as a standard way to give AI access to external tools — running tests, querying databases, calling APIs, searching the web, interacting with design tools, and more. These tools are increasingly portable: the same MCP server can work across Claude Code, GitHub Copilot, Cursor, and other harnesses.
Instructions and skills: Project-level instruction files (like .instructions.md or .cursorrules) that tell the AI about your codebase conventions, preferred patterns, and how to approach tasks. Custom skills and agent definitions let you package domain-specific knowledge that the AI can draw on.
Context sources: Documentation, codebase indexing, knowledge bases, and other reference material that help the AI understand your specific domain and codebase.

What makes capabilities a distinct layer (rather than just a feature of the harness) is their portability. You can take the same MCP servers, the same instruction files, and in many cases the same context sources, and use them across different harnesses. Your investment in configuring capabilities isn't locked to a single tool.

Model

The model is where the "thinking" happens. When you send a prompt, it's the model that interprets your intent, reasons about the problem, and generates tokens in response — whether that's code, prose, or increasingly other modalities like images and audio. Models differ across several key dimensions:

Reasoning ability: How well the model can break down complex problems, plan multi-step solutions, and handle nuanced logic. The emergence of dedicated reasoning modes (like "extended thinking") has been a significant step forward here, though higher reasoning levels consume substantially more tokens.
Code generation quality: The accuracy, correctness, and idiomatic quality of the code it produces across different languages and frameworks.
Tool use: How reliably the model can decide when and how to call external tools provided by the harness and capabilities layer, and how well it can structure its output to work with those tools.
Context window: How much text the model can "see" at once. Larger context windows mean the model can work with bigger codebases without losing track of important details.
Speed: How quickly the model generates responses. For interactive coding, latency matters, so a slower but more capable model isn't always the best choice for every task.

Today's landscape includes several categories of model:

Frontier models - like Claude, GPT, and Gemini - are the most capable, hosted in the cloud, and accessed via API. They're constantly being updated and represent the cutting edge.

Local models can run on your own hardware. Tools like Ollama or LM Studio make it straightforward to run open-weight models (e.g. Qwen, Llama, DeepSeek). They're typically less capable than frontier models, but they offer advantages in terms of privacy, cost (no per-token charges), and the ability to work offline. It's worth noting that "open-weight" doesn't always mean fully open — you can download and run the model, but you typically don't know how it was trained or on what data.

Specialist models are smaller, narrowly focused models tuned for specific tasks: speech-to-text (e.g. Whisper), text-to-speech, classification, summarisation, OCR, and more. While frontier models are incredibly powerful, they're also expensive; for high-volume business tasks, smaller and more cost-efficient models often make more sense.

And you don't have to pick just one. Many harnesses let you switch models on the fly, so you can use a fast, lightweight model for simple tasks and a more powerful frontier model when you need heavy reasoning.

Provider

The provider layer is often invisible to individual developers, but it's where many of the most important enterprise concerns live. When an organisation is evaluating AI coding tools, the questions at this layer tend to dominate the conversation:

Data residency: Where are your prompts and context being sent, and where are they processed? For regulated industries, data staying within a specific geographic region can be a hard requirement.
Security and compliance: Does the provider meet the organisation's security standards? Are prompts and code snippets logged or used for training? What certifications does the provider hold (SOC 2, ISO 27001, etc.)?
Copyright and IP: There are risks associated with generating code using models trained on public data, or agents retrieving proprietary information. Some providers offer guarantees around IP ownership and indemnity (such as Microsoft's Copilot Copyright Commitment), which can be crucial for commercial use.
Rate limits and availability: How many requests can you make before hitting throttling? Is there an SLA for uptime? For a team of developers relying on AI throughout the day, rate limits can become a real bottleneck.
Cost management: Pricing varies significantly, from flat-rate subscriptions to per-token usage billing. At scale, understanding and controlling costs becomes critical. This is compounded by the cost of reasoning: enabling higher reasoning levels on capable models can dramatically increase token consumption.

This is where options like Microsoft Foundry, and Amazon Bedrock come in. They let enterprises access frontier models through their existing cloud provider (though not always within the same infrastructure), with the governance, networking, and compliance controls they already have in place. You get your own dedicated capacity, and billing flows through your existing agreements. Model marketplaces like Hugging Face also play a role, providing a catalogue of models (both open-weight and commercial) that can be deployed on your own infrastructure.

For most individual developers, the provider layer is something you don't think about much — it just works. But for teams and organisations adopting AI coding tools at scale, it's often the layer that determines which tools are actually allowed to be used.

Bundled vs. mix-and-match

In practice, you'll see these layers packaged together in different ways. Some products bundle all layers tightly, while others give you the freedom to pick and choose.

For example, a Claude Pro subscription bundles everything: the claude.ai chat interface, Claude desktop app, and Claude Code CLI (harness), Anthropic's Claude models (model), and Anthropic's own infrastructure (provider). It's a clean, simple experience — but you're largely locked into Anthropic's choices at every layer.

GitHub Copilot takes a more flexible approach. You get VS Code / Visual Studio IDE extensions (or Copilot CLI) as your harness, but you can choose from a wide selection of models: Claude, GPT, Gemini, and others. The models are hosted on different providers behind the scenes, but this is abstracted away. You can also bring your own capabilities via MCP servers and instruction files. You can even use alternative harnesses like the Claude SDK from within the Copilot ecosystem, or connect local models.

Then there are tools like OpenCode or Roo Code, which are open-source harnesses that let you bring your own model and your own provider. You could run a local model on your own hardware, connect to an API key with OpenAI, or point it at an Azure OpenAI deployment your team manages. This gives maximum flexibility, but you're responsible for wiring it all up.

Blurring boundaries

It's worth noting that the boundaries between these layers are starting to blur. Agentic capabilities that used to live purely in the harness are increasingly being pushed into the model and provider layers. Code can run on your local device, in the cloud, or as a fleet of sub-agents. And you can orchestrate all of it from your IDE, a terminal, or even a mobile device. The mental model is still useful for making decisions, but the sharp lines between layers are softening as the ecosystem matures.

Conclusion

This post isn't intended to recommend a specific tool or combination - what's right for you will depend on your constraints, your team's needs, and the kind of work you're doing. But by understanding that there are four distinct decision points - harness, capabilities, model, and provider - and the trade-offs at each layer, you can make informed choices rather than getting lost in the noise. When someone asks "Should I use Cursor or Claude?", you'll know that's not quite the right question, and you'll know what questions to ask instead.

Integration Testing Azure Functions Part 5: Reqnroll in Build Pipeline

Jonathan George — Sat, 11 Apr 2026 06:35:00 GMT

TL;DR - This series of posts shows how you can integration test Azure Functions projects using the open-source Corvus.Testing.AzureFunctions.ReqnRoll library and walks through the different ways you can use it in your Reqnroll projects to start and stop function app instances for your scenarios and features.

In the previous posts in this series, we introduced the Corvus.Testing.AzureFunctions.ReqnRoll project and showed how you can use the bindings and classes it provides to start functions apps as part of your scenarios and features. We're going to finish with some pointers on how to ensure these functions can run as part of your build pipelines.

Depending on how your build system works, it's relatively easy to ensure that tests using these methods are able to run as part of the build pipeline.

If, like us, you're using Azure DevOps with hosted agents, you'll need to add a step to your pipeline to install the Azure Functions Core Tools. For YAML build definitions, it looks like this:

- task: Npm@1
  displayName: 'Install Azure Functions V4 Core Tools'
  inputs:
    command: custom
    verbose: false
    customCommand: 'install -g azure-functions-core-tools@4 --unsafe-perm true'

Once that step has run, the test will be able to execute as it does locally. The Corvus.Testing.AzureFunctions.ReqnRoll library targets .NET 8, which is cross-platform, so this should work on both Windows and Linux build agents.

GitHub Actions is similarly straightforward. Simply add a step to install the Azure Functions Core Tools towards the start of your pipeline.

This approach should also work for other CI servers and their hosted build agents. If you're using a private agent you have the option of installing the tools globally, meaning your build scripts can just assume they are present - this very much depends on how you prefer to manage build agents.

Summary

For Reqnroll users, the techniques I've shown in these posts will help ensure your integration tests are more complete by ensuring that functions under test are hosted in a way that closely matches the Azure environment they will ultimately run in.

I've tried to keep the posts simple by only covering testing HTTP triggered functions, but these techniques can equally be used to test functions that use other trigger types too. In our projects we've used it to test functions with blob, queue and Event Hubs endpoints, as well as functions using Durable extensions. This will generally require other infrastructure for your integration testing - for example, Azurite and/or Testcontainers to make storage available to trigger blob and queue endpoints - but the principles remain the same.

As mentioned above, the Corvus.Testing projects (Corvus.Testing.ReqnRoll and Corvus.Testing.AzureFunctions.ReqnRoll) are open source and contributions are accepted. If you encounter problems with them, please feel free to raise an Issue - and if you're able, submit a pull request. And if you have any questions or feedback, just ask!

Integration Testing Azure Functions Part 4: Reqnroll Configuration

Jonathan George — Sat, 11 Apr 2026 06:34:00 GMT

In the previous posts in this series, we introduced the Corvus.Testing.AzureFunctions.ReqnRoll project and showed how you can use the bindings and classes it provides to start functions apps as part of your scenarios and features. Here, we look at how you can vary the behaviour of those functions apps by providing or overriding configuration values.

When the FunctionsController is used to start a new function, it will check the ScenarioContext (if available) and FeatureContext for an instance of the FunctionConfiguration class. Any configuration provided here will be made available to the functions app when it starts.

If you're using step bindings to start your function, you can do this by writing an additional step binding to provide configuration from wherever you need to retrieve it. Note that this step must come before the one that starts the function. You can see an example in ScenariosUsingStepBindings.feature:

If you're using a BeforeScenario or BeforeFeature hook, you can add the configuration at the same time - as shown in ScenariosUsingPerScenarioHookWithAdditionalConfiguration.feature (and the corresponding hook method in DemoFunctionPerScenario, StartFunctionWithAdditionalConfigurationAsync), as well as the per-feature equivalents.

If you need to use different configuration at different times, you can create separate hook methods for setting up configuration - just ensure you use the Order parameter on the hook attributes to ensure the configuration is set prior to the functions being started.

In the next (and final) post in the series, we'll cover how to ensure the tests you've written using the techniques covered in these posts can run in your build pipelines.

Integration Testing Azure Functions Part 3: Reqnroll hooks

Jonathan George — Sat, 11 Apr 2026 06:33:00 GMT

In the first two posts in this series, we introduced the Corvus.Testing.AzureFunctions.ReqnRoll project and showed how you can use the Reqnroll step bindings it provides to start functions apps as part of your scenarios. This approach has the drawback of making your scenarios harder to read for non-technical users, so in this post, we're going to take a look at using scenario and feature hooks to address that problem.

Using per-scenario hooks to start and stop functions - as shown in `ScenariosUsingPerScenarioHook.feature`

Reqnroll hooks allow us to add code that's executed at specific points during a test run. With this method, we make use of the BeforeScenario and AfterScenario hooks, and use the Corvus.Testing.AzureFunctions.FunctionsController class directly to start and stop our functions. This can be seen in the DemoFunctionPerScenario class:

The parameters that the StartFunctionsInstance method takes are the same as those shown in the step binding above, allowing you to specify project, port and runtime. You'll see that the created FunctionsController instance is stored in the ScenarioContext; this allows us to pull it out in the AfterScenario method (which you should add yourself, as shown in the test code) to tear down the functions.

Advantages to this method

Using this method conceals the technical detail of what the setup step involves, reducing it to a single tag for the function. This makes your scenarios much more readable. If you're writing lots of tests for a specific functions app, it also reduces the duplication needed when every scenario has to contain the setup step.

In addition, test output (and the associated functions output) can be viewed in exactly the same way as above.

Disadvantages to this method

The main disadvantage to this approach is one that's associated with pretty much all integration testing: speed. Whilst setting up and tearing down all dependencies for each test is the gold standard, spinning up functions takes time and this can mean your test suite takes a long time to execute. In some scenarios this may be unavoidable. However, others may lend themselves to using the third method to strike a balance between speeding up execution and isolating tests.

Using per-feature hooks to start and stop functions - as shown in `ScenariosUsingPerFeatureHook.feature`

Visually, this approach looks extremely similar to the previous method. The scenario definitions are not affected at all and the only differences in the underlying code (other than using BeforeFeature and AfterFeature attributes) being that the FeatureContext is used in place of the ScenarioContext to store and retrieve the FunctionsController. The other difference is that the hook methods themselves need to be static to be used with per-feature hooks - this is a Reqnroll requirement.

Advantages to this method

If you can group related scenarios and be sure they won't conflict with one another, this is a relatively easy way of speeding up test execution.

Disadvantages to this method

As implied above, this approach does have the potential to cause unexpected results if your tests conflict with one another in any way. The other disadvantage is that because the function output is gathered and written to console when the function is terminated, it can no longer be seen in the test output. If you don't mind duplication, you can get round this by adding an additional AfterScenario hook to write the function output to the console after every scenario.

In the next post, we'll show how you can provide additional configuration to functions apps started as part of tests.

Integration Testing Azure Functions Part 2: Reqnroll step bindings

Jonathan George — Sat, 11 Apr 2026 06:32:00 GMT

In the first post in this series, we introduced the Corvus.Testing.AzureFunctions.ReqnRoll library. In this post, we're going to take a look at the simplest way of using it to start functions apps for testing purposes, which is to use the provided step bindings.

This is demonstrated in ScenariosUsingStepBindings.feature in the Corvus.Testing.AzureFunctions.ReqnRoll.Demo.Specs project.

The Corvus.Testing.AzureFunctions.ReqnRoll project contains a Given step definition for the following pattern:

[Given("I start a functions instance for the local project '(.*)' on port (.*) with runtime '(.*)'")]

If you include steps that match this pattern in your scenario, they will cause the functions defined in the specified project to be run, with HTTP functions listening on the specified port. If your function doesn't actually have any HTTP endpoints you can supply a dummy value for the port. Runtime will most likely be net8.0 (for Functions v4), though net9.0 and net10.0 are also options depending on your target framework.

The project to run is currently resolved by traversing up the folder tree until it finds a folder that, when combined with the function name, runtime and build folder, provides a valid path.

As well as bindings for these steps, there's an additional AfterScenario hook that goes with them to tear down the functions instances they start. You can start multiple functions in a single scenario using these bindings if necessary.

Viewing function output

Once the test run is complete, output from the functions app can be seen in the Test Detail Summary. In Visual Studio, this is visible in the Test Explorer by selecting the scenario that's been executed and clicking the scenario that's been selected:

Clicking the link "Open additional output for this result" will show Reqnroll's standard output capture:

As you can see from the screenshot above, this starts with the output from the BeforeScenario binding showing the solution and runtime location. If starting the function failed for some reason, you'd most likely see the reason here.

This is followed by Reqnroll's standard per-step output. Finally the output from the AfterScenario binding is shown, which is where the StdOut and StdErr for each function is added.

Note that the log shown in this window is frequently a truncated version of the whole. If this is the case, you'll see a message explaining how to access the full log by copying and pasting into another tool.

Advantages to this method

Using step bindings in this way makes it crystal clear to the developer what's going on as part of their spec. You can easily see which functions are being run and on what ports.

Disadvantages to this method

Whilst it's nice for developers to see exactly what technical setup is taking place, this does go against the goals of Behaviour Driven Development. Specifically, we should be striving to make the feature readable in the end user's language. When testing an API using a BDD spec, you can make a case that the end user whose language we should be using is a technical one - the consumers of APIs are most likely to be developers - but even so, this is an overly technical step to include in your scenarios.

In the next post, I'll show how this problem can be addressed using Reqnroll hooks to start and stop functions apps.

Integration Testing Azure Functions with Reqnroll Part 1: Introduction

Jonathan George — Sat, 11 Apr 2026 06:31:00 GMT

If you use Azure Functions on a regular basis, you'll likely have grappled with the challenge of testing them. The testing story for functions is not hugely well defined. If you're building your functions well, then there won't be a lot of code in them - they will be thin facades calling into code that does the bulk of the work, in which case you will likely have used a standard unit testing approach on that code. Nevertheless, that likely leaves some functionality untested - for example, ensuring your models are correctly bound to input, and ensuring that correct status codes, headers and so on are returned from your requests.

As such, it becomes necessary to step up a level and look at how to test the functions as a whole. There are two options for this:

You can test in-process, using the approach defined in Microsoft's docs (note that this doesn't seem to have been updated to take account of functions that use instance methods, but that's unlikely to affect the approach). This is good, but doesn't ensure that your function is configured correctly, and if you're using automatic model binding, it doesn't test that this is working as you expect.
Alternatively, you can test out of process, either against a deployed instance of the function or against one that's running locally. Testing against a deployed instance is a great idea, but this is normally reserved for another level of testing, meant to ensure that things are working as expected in a deployed environment. It doesn't address the needs of the developer as the feedback loop from making a change to deploying a function to Azure is likely just too long. This leaves us with the challenge of testing against a function running locally.

So, how do we go about this?

Before I continue I should note that while I'm specifically addressing how to do this with Reqnroll, a very similar approach can be taken with other frameworks. Reqnroll is the community-driven successor to SpecFlow, created by the original SpecFlow creator after SpecFlow reached end-of-life. If you're migrating from SpecFlow, the Reqnroll migration guide is a great place to start.

Goals

As always, it's worth starting with what we want to achieve:

We want a way of automatically starting a function, and then shutting it down once the test is completed.
We want this to work in as close a way as possible to a deployed function
Ideally, we want to be able to capture the output from the function while it's running.
It's useful to be able to easily affect the configuration of the function under test.
We want an approach that can work as part of a CI pipeline.

So, let's have a look at how we achieve these goals.

Running the function locally

When you hit F5 to run a function in Visual Studio, it uses a copy of the Azure Functions Core Tools that's managed by Visual Studio. Normally they get automatically installed into C:\Users\username\AppData\Local\AzureFunctionsTools\Releases and Visual Studio selects the correct version to use based on your project's runtime.

However, this is an internal detail of how Visual Studio implements the Functions SDK, so it's not really something we can rely on. Fortunately you can install and use Azure Functions Core Tools directly.

We recommend using Azure Functions v4 with the isolated worker model and .NET 8 or later. The isolated worker model is the recommended approach for new Azure Functions projects, and in-process support is scheduled to end in November 2026.

To get the tools installed, you have a few choices. If you're on Windows, you can use winget:

winget install Microsoft.Azure.FunctionsCoreTools

or Chocolatey

choco install azure-functions-core-tools

Otherwise, you'll need npm:

npm i -g azure-functions-core-tools@4 --unsafe-perm true

This will install the tools locally - you can verify they are there using the new func command from the command prompt. If you do this, you'll see all the things you can do with it - scaffolding new functions apps and functions, and running functions locally. The latter is what we're concerned with - you'll see that you can start a new function using the command func start, providing port number and other details as part of the command. This is what we're going to use when setting up our test.

Introducing Corvus.Testing

The code to start, stop and manage functions as part of a Reqnroll test is part of the endjin-sponsored Corvus.Testing libraries. The original Corvus.Testing repository has been split into separate, focused repos:

Corvus.Testing.ReqnRoll — general Reqnroll testing utilities
Corvus.Testing.AzureFunctions.ReqnRoll — Azure Functions-specific testing classes and bindings

The classes that we're interested in are part of Corvus.Testing.AzureFunctions.ReqnRoll and are:

FunctionsController.cs - this contains methods to start a new functions instance, and to tear down all functions it manages. It's intended to live for the lifetime of the test as it captures the output and error streams from the function and write them all to the Console when the functions are terminated. When running in Reqnroll, this results in that information being written to the test's output.

FunctionConfiguration.cs - this is part of the mechanism by which the test project can provide settings to the function under test.

FunctionsBindings.cs - this provides a couple of standard step bindings that can be used as part of a scenario to start a function.

This code is all open source, and contributions are accepted. It's available under the Apache 2.0 open source license meaning you're free to use and modify the code as you see fit. The license does impose some conditions around retaining copyright attributions and so on - you can read the full details here.

This code ticks the boxes for the first four of the five goals I set out above, providing mechanisms to keep functions running for the duration of test execution, as well as a way to supply additional configuration. The next few sections explain the different ways of using this.

I'll be doing this with reference to the demo projects that are part of the Corvus.Testing.AzureFunctions.ReqnRoll codebase. Before continuing, I recommend downloading the project so you can examine the code. There are two demo functions projects — Corvus.Testing.AzureFunctions.Demo.InProcess for the in-process model and Corvus.Testing.AzureFunctions.Demo.Isolated for the isolated worker model — that contain a slightly modified version of code that's generated when you create a new HTTP-triggered function in Visual Studio. They accept GET and POST requests, looking for a parameter called name in either the querystring or request body, and returning a configurable string containing that parameter.

It also contains a Reqnroll test project, Corvus.Testing.AzureFunctions.ReqnRoll.Demo.Specs containing feature files which relate to the following next few posts in this series.

In the next post, I'll show you how you can add steps to your Reqnroll scenarios to run your functions apps.

From Prompt Engineering to AI Programming: Enterprise GenAI Solutions

James Broome — Fri, 13 Mar 2026 05:30:00 GMT

As organizations race to integrate generative AI into their business workflows, they are hitting a familiar challenge - the gap between a cool demo and a reliable enterprise solution (one that you'd be confident betting the business, or at the very least your reputation on). To bridge this, we must shift the mindset from prompt engineering to rigorous AI programming and systematic evaluation, just like with any other software engineering project.

The quality challenge in the age of AI

Over the last year or so we have seen the world move at 100mph, with AI integration into every product (whether it's a good fit or not!) and organisations eager to deploy LLMs and AI services into production in their own systems and workflows. With frontier models readily available through cloud APIs, and deep enterprise integration into big data platforms through wrapper/portal services like Microsoft Foundry, it's very easy to get started and get excited about what's possible.

This is exacerbated by the fact that AI is now pretty much ubiquitous to all consumers of digital products. If business stakeholders are used to being able to get results instantly from native LLM app interfaces on their devices, then expectations are high right from the offset.

But there's a big gap between a PoC and a working, reliable enterprise solution. On the face of it, adding an LLM service to your application is just another API integration, but their behaviour brings with it a new set of engineering concerns that can cause problems if not understood fully.

This post explains what those things are, and why they should be treated the same as any other engineering quality concern, so that you can build AI-integrated solutions with confidence.

A timeline of engineering quality

Before we dig into the specifics of LLM models and AI services, it's useful to look back at other technology and software architecture patterns, and understand how we thought about them in terms of ensuring quality. The easiest way for me to do that is to look back at my own career. Clearly this won't be a fully comprehensive guide, but there's enough experience in there to highlight commonalities across technology stacks and ecosystems.

Establishing the foundations

When I began working as a software developer in 2001, after a brief dabble with Borland Delphi and ASP pages, I quickly found myself immersed in the world of .NET. This was .NET 1.1 territory and ASP.NET Web Forms (around which there were a lot of strong opinions!). As a framework that was designed to magically generate HTML web pages using a set of server-side components, it was inherently difficult to pull apart business logic from the user interface layer, which made it hard to test.

But what we were building was complicated (a government funded, UK-wide, multi-model journey planner, pre-Google Maps) and we needed to prove that it was working correctly. The "auto-magic" that the framework provided allowed for rapid development of things that were simple, but started to get in the way and make things harder as the logic and interactions became more complex. I learned about unit testing as these tools started to become available for .NET (NUnit, MbUnit and then MSTest), and how to refactor code to be able to validate the things we needed to. Having confidence that things were working as they should shifted the dial from slow and brittle feedback loops to rapid, reliable validation cycles.

By the time the ALT.NET movement gained traction towards the end of the 2000's, I was a full-blown TDD aficionado. Shifting the focus to test-first made for better system design, and along with it came more advanced techniques and tooling like mocking, inversion-of-control containers and continuous integration processes to automate quality gates.

Around this time, I was also lucky enough to attend one of JP Boodhoo's .NET engineering bootcamps in Vancouver, which embedded the value of executable specifications with Behaviour Driven Development (BDD), highlighting the importance of natural language and encouraging closer collaboration between business and technical teams.

These core principles have been the foundation underpinning all software development I've been involved in since, enabling high-quality delivery of well documented code in iterative development cycles. They've been applied across web and native application stacks and across a variety of architecture patterns. But whilst there's been specific implementations and frameworks that became the flavour of the month according to the language or toolset of the moment, it's the concepts and approach that have always been the thing.

Contract-first, observability and resilience in the cloud

Fast forward to the 2010s and the world had moved on to API-first, REST-based architectures. At this point I was leading a small team responsible for building the payment processing engine for a large Middle-Eastern airline. With these APIs any mistakes could literally cost money, so as well as ensuring that we had comprehensive test coverage, we also focused on instrumentation and observability to help us diagnose things in our production environment. This was how I learned (the hard way!) that not all currencies have 2 decimal places! And more generally that if you're building an API, you also need a way to execute it. This was pre-Postman, and pre-Swagger, so the only thing to do was build our own version of an API client - a test harness that could be used to execute and validate the various endpoints.

In parallel to this, we were moving everything into the cloud, and started to encountered a new set of quality challenges. Distributed systems brought transient failures, eventual consistency, and the need for sophisticated retry logic.

This experience taught me that integration testing isn't just about verifying that your code works, it's about understanding the contract between systems, the innate behavioural patterns and capabilities of those systems, and building evaluation harnesses that can systematically validate behaviour across different scenarios.

Data, machine learning, and the challenge of uncertainty

By the time we get to 2015, the cloud had also allowed us to capture a lot more data. And so the landscape shifted again with an increasing focus on data engineering - machine learning, data science, cloud data platforms and advanced analytics. The data space was, and in many ways still is, less mature when it comes to engineering practices. But whilst the challenges were different, I found myself still applying the same "there's always a way to test something" mentality.

With machine learning and data science, the tendency to draw conclusions from patterns in data that are really just random noise could be balanced with structured experimentation, upfront definition of success metrics and rigorous validation.

Cloud data pipelines that were asynchronous and long-running meant you couldn't just run a quick unit test and get instant feedback. So we needed new approaches - schema validation to catch structural changes and data quality issues early, synthetic data generation to test edge cases that might not appear in production for months and data snapshot testing to validate consistency of pipeline outputs over time. We developed approaches for testing data quality at scale, validating not just that pipelines completed successfully, but that the data they produced was fit for purpose.

The key insight from this is that uncertainty doesn't mean untestable, it just means we need different validation strategies. This sometimes meant shifting from testing specific values to testing behaviours and patterns. My talk at SQL Bits in 2024 - "Do those numbers look right?" summarises how we were thinking about engineering quality in our data solutions, and deep dives into practical approaches for testing Power BI reports, data pipelines and Spark and Python interactive notebooks.

So are LLMs really any different?

Yes and no. On the one hand, they're just another integration - either via an API, or through local model deployment. But on the other hand, they exhibit characteristics that require us to think differently about validation.

They're non-deterministic: Unlike traditional APIs where the same input always produces the same output, LLMs can generate different responses each time due to their probabilistic nature (even when you set the temperature to 0). This makes traditional unit testing approaches which rely on exact output matching ineffective.

They can hallucinate: LLMs can confidently generate plausible-sounding but false information. Unlike a database query that either returns valid data or throws an error, an LLM might return a well-constructed response that is actually wrong - syntactically correct, but semantically and factually incorrect. This requires us to validate not just the structure of responses, but their factual accuracy and relevance.

They produce qualitative, unstructured responses: Traditional software returns structured data (JSON objects, numerical values, boolean flags etc.). LLMs return natural language, which is inherently ambiguous and context-dependent. Despite advancements in the area of structured outputs, this still isn't 100% reliable. And how do you write an assertion that validates "the response should be friendly and helpful"?

However, none of these challenges are entirely new. Non-determinism shows up in async operations, race conditions, rate limiting, or time-dependent behaviour. User experience validation can be qualitative in nature. And we've had to deal with integration points that might return unexpected results.

On that basis we can, and should, apply the same core engineering principles we've always used, albeit adapted to the unique characteristics of LLMs.

1. Break open the black box

Just as ASP.NET Web Forms made it hard to test by tightly coupling UI and logic, LLM integrations can become black boxes if we treat them as magic, closed systems. The solution is the same - refactor to separate concerns. This means treating your LLM interaction as a discrete component with clear inputs and outputs. Which are the configurable bits that you have control over (e.g. the model version, the prompt, the temperature etc.), and which bits are "inside the box" (e.g. the inner workings of the model, system prompts etc.)?

For example, don't embed prompts directly in your application code. Instead, create a prompt management layer that allows you to version, test, and iterate on prompts independently of your application logic.

At endjin, when we build LLM-powered solutions, we structure them so that the prompt construction, model invocation, and response parsing are separate, testable components. This helps to unlock the ability to swap models, adjust parameters, or refine prompts without touching core business logic.

2. Embrace natural language as a feature

One of the biggest insights from Behaviour Driven Development was that natural language specifications bridge the gap between business stakeholders and technical teams. LLMs flip this on its head - natural language isn't just the specification, it's also the programming interface.

This could be seen of as an advantage. You can write evaluation criteria in plain English: "The response should identify the customer's primary concern", "The summary should be under 100 words", "The sentiment should be appropriate to the context". Then you can use LLMs themselves to evaluate these criteria. Don't forget that the LLM can also be instructed to return additional numerical or categorical information that can augment the natural language response (for example a confidence level between 0 and 1), which can enable more traditional testing to still be performed.

Taking it a step further, part of the specification could be a feedback loop to improve the specification (akin to getting someone else to review your work) before you execute the steps. This might mean LLMs all the way down, but in a good way.

The key is to be systematic about it. There's possible weaknesses around this approach when you consider the vagaries of language, but creating a feedback loop to explore the context with an LLM can be very powerful. Define your success criteria upfront, create evaluation prompts that assess those criteria, and validate them against labelled examples before you rely on them in production.

3. Test-first, even for prompts

The discipline of Test-Driven Development teaches us to think about desired outcomes before implementation. This is even more important with LLMs, where it's easy to iterate endlessly on prompts without a clear definition of success.

Start by defining your test cases - specific inputs with expected behaviours. Not exact outputs (remember, we're dealing with non-deterministic responses) but behavioural expectations. For a customer service chatbot, you might want to identify what a complaint is about and make sure the right resolution is offered.

For example:

Given the customer comment is "Despite paying extra for speedy postage, the promised delivery date was missed by 3 days!"
When the chatbot generates a customer service response
Then the response should identify the core issue as 'delivery delay'
And the tone should be 'apologetic'
And the resolution offered should be 'refund of premium postage'

Then iterate on your prompt design until your system reliably meets these criteria. Track your success rate over time. If you're getting 85% success on your test suite, that's a quantifiable baseline you can work to improve. This is infinitely better than the "it seems to work pretty well" approach.

4. Build or use evaluation harnesses

Just as we built and used custom API clients to test our APIs, we need evaluation harnesses for LLM integrations. The good news here is that lots of AI developer services have eval frameworks built in. These aren't just test scripts, they're tools that allow you to systematically evaluate performance across multiple dimensions.

Your evaluation harness should:

Run your prompts against a diverse set of test inputs
Capture and version the outputs for comparison
Apply multiple evaluation metrics (accuracy, relevance, tone, safety)
Track performance over time as you refine prompts, change models or a model version is incremented
Handle transient failures and rate limits gracefully

There's comparisons here with machine learning models that are trained / used in Data Science experiments. Whilst you can't do input / output testing, you can test acceptable tolerance ranges, error rates etc.

5. You need labelled datasets

Just as you can't train a machine learning model without labelled data, you can't validate an LLM integration without example inputs and expected behaviours. Even a small amout of well-chosen examples can provide meaningful validation. But also focus on edge cases and failure modes. What happens when the input is ambiguous? When it's in a different language? When it contains unusual formatting or special characters?

As your system matures, invest in building a larger, more diverse evaluation dataset. Involve domain experts that can set the acceptance criteria of the system. This becomes your regression test suite, allowing you to confidently refine prompts or change models while ensuring you haven't broken existing functionality. Given the volume of tests that will be required, don't underestimate the the level of effort required to curate them. This contributes to the TCO of the solution - production grade LLM solutions are expensive endeavours.

6. Embrace transient failures and build resilience

LLM APIs, like any external service, can experience transient failures, rate limits, performance degradation or varying response times. Your integration needs to handle these gracefully.

Implement retry logic with exponential backoff. Cache responses where appropriate. Monitor latency and error rates. Build fallback strategies for when the LLM service is unavailable. These are the same patterns we use for any cloud API integration , just applied to a different integration point.

7. Test behaviours, not values

Finally, apply the same patterns for validating data quality - focus on behaviours and patterns rather than specific values. Instead of asserting that a generated email contains the exact phrase "Thank you for your inquiry", check that it:

Addresses the customer's stated concern
Maintains an appropriate tone
Includes relevant next steps
Doesn't contain factual inaccuracies

This is more robust to the natural variation in LLM outputs, while still catching the failure modes that matter. It's often beneficial to use a separate LLM model or service to do this validation so that you avoid the innate bias used in generating the original output being used to validate it (i.e. don't ask an LLM to mark its own work). This technique is referred to as LLM-as-a-judge.

8. Instrument and observe

Across all of these approaches, there's a common thread - you need visibility into what's happening. As more work is handed to the LLM, this accentuates the need for human oversight, supervision, quality checks, curation of more examples, responding proactively to new situations (e.g. a new type of enquiry driven by external factors). Instrument your LLM integrations thoroughly. Log inputs, outputs, and evaluation metrics. Track latency, error rates, and cost. Monitor for drift in model behaviour over time.

At endjin, we treat observability as a first-class concern for LLM integrations, just as we do for any other production system.

Summary: From black box to engineering discipline

The journey from prompt engineering to AI programming is really a journey from treating LLMs as magic to treating them as engineered components. Yes, they have unique characteristics (non-determinism, hallucinations, qualitative outputs), but these aren't deal-breakers, they're just new constraints that require adapted approaches.

To succeed in deploying reliable, enterprise-grade AI solutions, treat LLM integrations with the same engineering standards that you apply to any other system component. That means systematic evaluation, defensive engineering, instrumentation, and continuous improvement.

At endjin, we've seen this transformation happen with our clients. The shift from "let's see if this prompt works" to "here's our evaluation framework and current performance metrics" represents a fundamental change in how teams think about AI integration. And it's that shift that enables moving from demos to production-ready solutions with confidence.

In subsequent posts, I'll dive deeper into the practical tools and frameworks that make this systematic evaluation possible.

Scaling API Ingestion with the Queue-of-Work Pattern

Jonathan George — Fri, 06 Mar 2026 06:30:00 GMT

TL;DR; The queue-of-work pattern enables massive parallelism for HTTP based API ingestion by breaking large jobs into thousands of independent work items processed by concurrent workers. This approach reduced our data ingestion time from 15 hours to under 2 hours while providing automatic retry handling and fault tolerance at a fraction of the cost of traditional orchestration tools.

Sample code to go with this blog post can be found in GitHub

The Problem: When Sequential API Calls Take Days

It's common in data platforms to need to acquire data from HTTP based APIs. If written well, these APIs will be reliable, fast and will allow you to control filtering, windowing, and pagination. But even if all of this is true, ingesting large amounts of data via an API can be challenging.

We recently faced one of these challenges when building out a new workload on an existing Azure Synapse-based modern data platform: the initial load of data from a source system required us to ingest about 2,000,000 records and the only way of doing this was via an HTTP API. A full synchronization requires tens of thousands of individual API requests, each fetching a page of 100 records of data.

We started by estimating how long this would take with a simple sequential approach. The API averages about 500ms per request. For 20,000 API calls:

20,000 requests × 0.5 seconds = 10,000 seconds = 2.8 hours

We also need to consider the volumes of data being processed. For us, the payload size is generally from 200KB to 1.5MB, with an average of 1MB. So 20,0000 requests yields around 20 GB of data.

That seems manageable, right? But this calculation ignores several real-world factors:

Network variability: Some requests take 2-3 seconds, especially during peak hours
Failures and retries: Network issues and transient API errors require retry logic
Data processing: Each response needs parsing, transformation, and storage

In reality, a test retrieval of a subset of records suggested we were looking at something in the region of 48 hours to do a full ingestion, accounting for failures and retries.

That's too long. Our requirements were clear:

Scalable: Process massive volumes efficiently through parallelism
Fault-tolerant: Handle API failures gracefully without losing progress
Fast: Complete full ingestion in under 12 hours

Why Not Use Synapse Pipelines?

The customer in this scenario already had a well established Azure Synapse platform in place. As a result the first port of call for data ingestion is normally a Synapse pipeline. After all, Synapse is designed for data orchestration, has many built in connectors and is already fully integrated with our data lake. However, we quickly discovered several dealbreakers for this specific use case:

Cost Inefficiency

Synapse pipelines charge per activity execution (approximately £0.00085 per activity run) and per integration runtime hour. Let's break down the cost for our 40,000 API calls:

20,000 activities × £0.00085 = £17 per full ingestion
Integration Runtime: ~£0.18/hour × 48 hours = £8.64
Total: ~£26 per full sync

While £26 doesn't sound expensive, issues outside our control mean it's likely we'd need to do this several times a year. We'll also be running nightly incremental updates; although these process way smaller amounts of data, there will occasionally be bulk updates in the source system that require larger data volumes to be reingested.

Over a year, this all adds up. But more importantly, Synapse's orchestration overhead makes it unsuitable for high-volume, small operations regardless of cost.

Limited Resilience

Synapse pipelines support retry logic, but their error handling is coarse-grained. If one API call in a batch of 1,000 fails, the retry mechanism repeats the entire batch, not just the failed item. There's no built-in concept of a "poison message queue" for persistently failing records.

When we tested this with a deliberately flaky API endpoint, a single bad record caused the same batch to retry indefinitely, blocking progress on thousands of good records. This simply won't work for production ingestion where some records may have data quality issues.

Orchestration Overhead

Synapse pipelines introduce latency between activity transitions. In our testing, even with an empty pipeline activity, there's typically 3-5 seconds of overhead per activity execution. For 20,000 activities:

20,000 activities × 4 seconds overhead = 80,000 seconds = 22 hours

This overhead alone exceeds our entire time budget. The pipeline execution model is optimized for long-running data transformations, not for coordinating thousands of quick API calls.

Lack of Dynamic Scaling

Scaling Synapse pipeline execution requires manual configuration of integration runtime settings. You can't easily spin up hundreds of concurrent workers dynamically based on queue depth, then scale down to zero when work completes. This inflexibility makes it difficult to optimize both cost and performance.

The Queue-of-Work Pattern: A Better Approach

After ruling out Synapse pipelines, we needed a solution that could handle parallelism without orchestration overhead. The queue-of-work pattern emerged as the ideal approach, and it's surprisingly simple in concept: break the large job into thousands of small, independent work items, put them in a queue, and let multiple workers process them concurrently.

The pattern decouples work distribution from work execution, which turns out to be the key to solving all our challenges. Here's how it works:

Pattern Overview

┌─────────────────┐
│   Ingestor      │  1. Breaks large job into small work items
│   (Enqueuer)    │  2. Enqueues items to Azure Storage Queue
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Azure Storage  │  3. Durable, distributed queue
│     Queue       │  4. Supports automatic retry & poison handling
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│     Queue       │  5. Dequeues messages
│   Processor     │  6. Dispatches to work item processors
│   (Workers)     │  7. Processes API calls
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Data Lake     │  8. Writes results
│   Storage       │
└─────────────────┘

Key Components

Work Items: Strongly-typed dataclasses that represent individual units of work (e.g., "fetch assets with IDs 1000-5999")
Work Queue: An abstraction over Azure Storage Queues that handles serialization, dequeuing, and poison message management. Using an abstraction makes testing simpler.
Work Item Processors: Decorated functions that execute the actual work (API calls, data transformation, persistence)
Queue Processor: A loop that dequeues messages, dispatches them to the appropriate processor, and handles errors
Work Item Dispatcher: A registry that maps work item types to their processors using decorator-based registration

How the Pattern Solves Our Challenges

1. Scalability Through Parallelism

The first benefit of this pattern became apparent when we broke down the ingestion workload. Instead of processing 1 million assets sequentially, we could divide them into manageable chunks:

# Breaking down 1 million assets into 200 independent work items
max_asset_id = 1_000_000
batch_size = 5_000

for start_id in range(0, max_asset_id, batch_size):
    work_queue.enqueue(
        IngestItemsByIdWorkItem(
            snapshot_time=timestamp,
            correlation_id=correlation_id,
            from_id=start_id,
            to_id=start_id + batch_size - 1
        )
    )

This creates 200 independent work items, each responsible for 5,000 assets. Why 5,000? We tested different batch sizes and found this offered the best balance - large enough to amortize queue overhead, small enough that a single failure doesn't waste too much work.

Now here's where it gets interesting. With these work items in the queue, we can run multiple queue processors simultaneously:

Each processor dequeues and processes messages independently
No coordination overhead between workers - they don't even know about each other
Azure Storage Queues handle concurrent access automatically with optimistic concurrency
Scale up the number of workers simply by deploying more instances

In our production environment, we run this as Azure Container Apps Jobs. We did some testing to find out what level of concurrency can be supported by the target API - sending too many concurrent requests would overload the API and cause failures to spike. As a result of this, during a full ingestion we typically scale to 12 concurrent workers. The math works out well:

20,000 API calls ÷ 12 workers = 1,666 API calls per worker
1,666 calls × 0.5 seconds average = 833 seconds = 14 minutes

In practice, we see full ingestion complete in under 2 hours due to API variability and processing overhead, but this is still a massive improvement over the roughly 48 hours sequential approach.

2. Fault Tolerance Through Retry and Poison Queues

One of the most challenging aspects of large-scale API ingestion is handling failures gracefully. APIs are unreliable - they timeout, return 500 errors, get rate-limited, or occasionally return malformed data. We needed the system to handle these failures without manual intervention.

Azure Storage Queues provide built-in retry semantics through visibility timeouts and dequeue counts. Here's how we leveraged them:

class AzureStorageWorkQueue:
    def __init__(self, queue, poison_queue, max_dequeues=5):
        self._queue = queue
        self._poison_queue = poison_queue
        self._max_dequeues = max_dequeues

    def dequeue(self):
        # Visibility timeout of 300 seconds (5 minutes)
        message = self._queue.receive_message(visibility_timeout=300)

        # Automatic poison message handling
        if message and message.dequeue_count > self._max_dequeues:
            # This message has failed 5+ times, move it to poison queue
            self._poison_queue.send_message(message.content)
            self._queue.delete_message(message)
            return None  # Skip this problematic message

        return DequeuedWorkItem(
            work_item=jsonpickle.decode(message.content),
            dequeued_message=message
        )

Note - to keep the code simpler here, I haven't shown any additional error handling logic around the calls to the storage API. Depending on the exact logic and control mechanisms you have an place, you should consider adding error handling and retry logic around API calls like queue.receive_message.

How this works in practice:

When a worker dequeues a message, it becomes invisible to other workers for 5 minutes. If the worker successfully processes it, the message is deleted. If the worker crashes or the API call fails, the message automatically becomes visible again after 5 minutes for another worker to retry.

Azure Storage Queues track how many times each message has been dequeued. After 5 failed attempts, we consider the message "poisoned". This is a standard term in message processing systems that refers to a message which has repeatedly failed to be dispatched or processed, likely due to bad data or a systematic issue. We don't want to waste resources by continually trying to process the message, but we don't want to lose it either so we move it to a separate "poison queue" for manual investigation.

It should be noted here that even if you've chosen a finite TTL for your message queue, you should ensure that messages on the poison queue never expire by setting their TTL to -1.

This approach has proven remarkably resilient, and it's unusual to see runs end with messages on the poison queue. We've added a check at the end of processing to raise an alert if this does happen so that the bad messages can be investigated. Options also exist inside Azure to monitor message queues for other scenarios - e.g. an excessive number of messages in either queue - and raise alerts.

This approach also aids recovery in the event of a catastrophic failure. For example, during one ingestion run, the source API had a 6-hour outage. This meant that the majority of the queue messages ended up on the poison queue. Once we realised what had happened, we were able to quickly move all of the messages from the poison queue back to the processing queue and rerun them.

3. Speed Through Asynchronous Processing

The queue processor implementation is deliberately simple, which turns out to be a performance advantage:

def process_queue(correlation_id, work_queue, logger, ...):
    while not queue_empty:
        message = work_queue.dequeue()
        if message is None:
            queue_empty = True
            continue

        try:
            # Dispatch to appropriate processor
            WorkItemDispatcher.dispatch_work_item(
                message.work_item,
                logger,
                **kwargs
            )
            # Success: remove from queue
            work_queue.remove(message)
        except Exception as e:
            # Failure: message stays in queue for retry
            logger.error(f"Error processing: {e}")

This simple loop runs on each worker container. There's no complex orchestration, no distributed locking, no coordination between workers. Each worker independently:

Dequeues a message (blocking call, returns when message available)
Processes it
Deletes the message on success
Repeats until the queue is empty

We measured the queue overhead itself - dequeue plus delete operations - at consistently under 50ms. This means for a typical 500ms API call, only 10% of the time is queue overhead. For slower API calls (2-3 seconds), the overhead becomes negligible.

The pattern also enables progressive completion. Unlike batch systems where you wait hours to see any results, data appears in the data lake as soon as each work item completes. This was particularly valuable during testing - we could verify the data pipeline was working correctly within minutes rather than waiting for a full ingestion to complete.

4. Clean Separation of Concerns

The dispatcher pattern keeps code maintainable:

# Define a work item type
@dataclass
class IngestItemsByIdWorkItem(WorkItem):
    from_id: int
    to_id: int

# Register its processor with a decorator
@work_item_processor(IngestItemsByIdWorkItem)
def process_items_by_id(work_item, logger, **kwargs):
    api_client = kwargs["api_client"]
    writer = kwargs["writer"]

    # Fetch data from API
    response = api_client.get_items(
        build_query_params(work_item.from_id, work_item.to_id)
    )

    # Persist to data lake
    writer.persist_assets(work_item.snapshot_time, response)

Benefits:

Each processor focuses on one specific task
Easy to add new work item types without modifying existing code
Type safety through Python dataclasses
Testable in isolation
Clear dependency injection through kwargs

Important Considerations

Before implementing this pattern, there are several important factors to consider:

Data must be able to be partitioned: Since all the work is enumerated up front, you need a way of partitioning your data into a large number of small chunks. For our full ingestion, we chose ID ranges and for our incremental updates we chose time periods. This will be driven in part by the querying options supported by the API.

Idempotency is critical: Since messages can be retried automatically, your processors must be idempotent. If a worker crashes after persisting data but before deleting the message, another worker will process the same message again. In our implementation, we write data to the data lake with consistent file paths based on the work item parameters, so re-processing simply overwrites the existing file with identical data.

Message size limits: Azure Storage Queue messages are limited to 64 KB. Our work items are small (typically <1 KB when serialized), but if you need to pass large payloads, consider storing the data in blob storage and passing a reference in the message.

Visibility timeout tuning: The 5-minute visibility timeout works well for our API calls, but you'll need to adjust this based on your workload. Too short and messages might be retried while still being processed (duplicate work). Too long and failed messages won't be retried quickly enough.

Cost considerations: While Azure Storage Queue costs are minimal (~£0.00028 per 10,000 operations), container runtime costs can add up. With 50 workers running for 2 hours:

12 workers × 2 hours × £0.0001/vCPU-second × 0.5 vCPU × 3600 seconds
= approximately £4.32 per full ingestion

This is cheaper than the Synapse cost but provides much better performance and flexibility.

Avoid sensitive data in messages: Work items should contain only metadata (IDs, timestamps, pagination offsets), not actual sensitive data. The actual data from API responses goes directly to the data lake, not through the queue.

Real-World Implementation Example

Let's walk through a complete flow showing how these components work together for ingesting data:

Step 1: Enqueue Work (Main Orchestrator)

The first step runs in a dedicated "enqueuer" container that breaks the full ingestion job into work items:

class BronzeIngestor:
    def enqueue_full_ingestion(self, snapshot_time, correlation_id):
        # First, query the API to determine the scope of work
        # This makes a single API call to get the maximum asset ID
        max_asset_id = self._get_maximum_asset_id()

        # Break into batches of 5,000 assets each
        batch_size = 5_000
        current_id = 0

        while current_id < max_asset_id:
            self._work_queue.enqueue(
                IngestItemsByIdWorkItem(
                    snapshot_time=snapshot_time,
                    correlation_id=correlation_id,
                    from_id=current_id,
                    to_id=current_id + batch_size - 1
                )
            )
            current_id += batch_size

This enqueuing process typically completes in under a 30 seconds for 20,000 work items. The snapshot_time ensures all workers use the same timestamp for file paths, making the data lake files consistent. The correlation_id ties all work items to the same ingestion job for tracing.

Depending on how many items you end up needing to enqueue, and what else lives in your queue, you will also need to consider error and recovery scenarios here. What if you fail half way through enqueuing work items? Whilst the items themselves need to be idempotent, we don't want to put multiple of the same item on the queue if we can avoid it.

If you have one queue per process, then the simplest option will be to ensure the queue is empty before you start processing, and have a recovery process which clears messages down in case of error.

Step 2: Process Queue (Worker Containers)

Multiple worker containers run simultaneously, each executing the same code but processing different messages:

class BronzeIngestor:
    def process_queue(self, correlation_id):
        # Set up dependencies that processors will need
        processor_kwargs = {
            "api_client": self._api_client,
            "writer": self._bronze_writer
        }

        # This runs until the queue is empty
        # Each worker container runs this independently
        process_queue(
            correlation_id=correlation_id,
            work_queue=self._work_queue,
            logger=self._logger,
            get_processor_kwargs=lambda msg: processor_kwargs
        )

The process_queue function (shown in the "Speed" section above) is just a simple loop. When the queue is empty, the worker exits gracefully.

Step 3: Execute Work (Registered Processor)

The dispatcher routes each work item to its registered processor. The @work_item_processor decorator registers this function to handle IngestItemsByIdWorkItem instances:

@work_item_processor(IngestItemsByIdWorkItem)
def process_items_by_id(work_item, logger, **kwargs):
    api_client = kwargs["api_client"]
    writer = kwargs["writer"]

    # Each work item handles a range of IDs (e.g., 1000-5999)
    # But the API paginates responses, so we need an inner loop
    offset = 0
    limit = 100  # API returns 100 assets per page

    while True:
        # Make API call for one page of results within this ID range
        response = api_client.get_items(
            build_query_params(
                work_item.from_id,
                work_item.to_id,
                offset,
                limit
            )
        )

        assets = json.loads(response)["assets"]
        if not assets:
            break  # No more results, we're done with this work item

        # Write this page to the data lake
        # File path includes snapshot_time for consistency across workers
        writer.persist_assets(
            work_item.snapshot_time,
            work_item.from_id,
            work_item.to_id,
            offset,
            response
        )

        offset += limit

For a work item covering IDs 1000-5999 (5,000 assets), this typically makes 50 API calls (100 assets per page). The entire work item takes about 25 seconds to process, which fits comfortably within the 5-minute visibility timeout.

Step 4: Experimentation and tuning

Once you have a working implementation, then some experimentation is needed to validate and tune batch sizes and the amount of parallelism you can support.

When considering batch sizes, there are a variety of factors to take account of. Firstly, how many API requests will be required to ingest a single batch? As mentioned above, our 5,000 asset batches result in around 50 API calls.

If any one of them fails, they will all be retried. If you can't rely on the API to consistently ingest an entire batch of data without failure, you should tune your batch size to minimise the impact of this.

Also, how much data will the API return in a single batch? If you need to process an entire batch of data at once (even if it requires multiple API calls to retrieve), and your payload size is large, will you have enough memory to process it all?

How long will a single batch take to process? This has an effect on your queue visibility timeout; if your batch takes 5 minutes to process but your queue visibility timeout is 2 minutes, this means different instances of your queue processor will likely end up processing the same message.

When considering parallelism, this is mainly down to the API you're ingesting from. Depending on your deployment architecture (discussed below) the overall cost for the process may be similar regardless of whether you run 10, 20 or 50 processors in parallel. However, the API you're ingesting from might not be able to cope with this - at worst, you could end up crashing the API, or making it unresponsive for other users. In our scenario, the API we retrieve the data from is the same as that used by the front end application, so we needed to avoid this.

Clearly the two factors are linked. For example when evaluating the API we are ingesting from we discovered that:

larger batch sizes are more efficient because the system behind the API caches the query, so retrieving subsequent pages for a batch is relatively fast compared to retrieving the first page.
smaller page sizes are better, as the API struggled to serialize large pages of data quickly.
the API could reliably cope with running 12 processors in parallel; more than that started to significantly impact production performance for other users. However, we also established that we got better results when running the ingestion process outside working hours.

This experimentation process needs to be done carefully. If you are experimenting on a production API, it's likely you need to work with the owners to ensure you don't end up rendering their API unsable. If you're working in a sandbox environment, you need to bear in mind that it will likely have a different allocation of compute resources to the production environment, so you may need to include levers to allow you to tune the process once you reach production.

Underpinning this experimentation is ensuring you have baked in observability so that you can evaluate the resulting telemetry to inform your decision - more on this later.

Deployment Architecture

In production, this pattern runs on Azure Container Apps, which provides the perfect hosting model for this workload.

Container Jobs

Azure Container Apps Jobs are designed for workloads that run to completion then exit - exactly what we need.

We have a single container job which is called with arguments to either run the enqueuing or processing. The job can be started multiple times to support parallelism.

Container size

ACA offers a variety of combinations of CPU and memory, which allows you to select an appropriate. The exact options depend on whether you're using a Dedicated plan or a Consumption one. We used a Consumption plan, which allows you to size your container from as small as 0.25 cores and 0.5 GB RAM up to 4 cores and 8 GB RAM (as of March 2026).

Since we're not doing a lot of processing, but we are processing reasonably large chunks of data, we chose the 0.5 core and 1 GB RAM option, and it's serving us well. As with most other things, choosing the right container size is a matter of doing some initial calculations and then experimenting until you achieve consistent performance and reliability.

Orchestration

Although we're running the enqueuing and processing logic in Azure Container apps, we're still orchestrating things via a Synapse pipeline. We chose this route for consistency; we're orchestrating all of the other processes in our data platform via Synapse pipelines and we didn't want to introduce other approaches.

We've created a pipeline that can trigger an Azure Container App Job, passing in the necessary parameters. This can run either as a fire-and-forget process, or it can poll for job completion. We use the former for the full ingestion, and the latter for the incremental ingestion as this allows us to immediately trigger data processing once ingestion is complete.

Cost Optimization

Container Apps Jobs only consume resources while running:

Enqueue job: Runs for ~1 minute, costs negligible
Worker jobs: Run for ~2 hours during full ingestion
Zero cost when idle - no minimum running instances required

This is significantly more cost-effective than keeping functions warm or maintaining always-on compute resources.

Scaling Strategy

While Azure Container Apps supports automatic queue-based scaling using KEDA, we've found that starting with a fixed number of workers (12) works well for predictable workloads. For unpredictable workloads, especially when the target API can support a higher rate of requests than ours, you could configure scaling rules based on queue depth:

Queue depth > 1000 messages → Scale to 50 workers
Queue depth < 100 messages → Scale to 10 workers
Queue empty → Scale to 0

Observability

OpenTelemetry tracing provides end-to-end visibility. Each work item creates its own trace span, making it easy to:

Identify slow API endpoints
Track which work items failed and why
Identify failure patterns
Measure end-to-end ingestion duration
Correlate all work items for a specific ingestion job using the correlation_id

Benefits Beyond the Original Requirements

While we initially focused on solving the core challenges of scale, fault tolerance, and speed, the pattern has delivered several unexpected benefits:

Incremental ingestion came for free: Once the framework was in place, adding incremental ingestion was trivial. We created a new work item type (IngestItemsModifiedSinceWorkItem) that queries for recently modified assets instead of ID ranges. The dispatcher, queue processing, and retry logic all work identically.

Debugging became significantly easier: When something goes wrong, the poison queue contains the exact work item that failed, complete with all parameters. We can inspect it, fix the underlying issue (often a data quality problem in the source system), then manually re-queue it without reprocessing thousands of successful items. This has saved hours of debugging time.

Testing improved dramatically: Previously, testing the full ingestion pipeline required running against the actual API or building complex mocks. Now we can test individual processors in isolation by constructing work items with test data. Integration tests can enqueue a few work items and verify the results without processing the entire dataset.

Cost visibility is excellent: Azure Storage Queue operations cost approximately £0.28 per million operations. For our 20,000-message ingestion, that's less than £0.01 in queue costs. The predictable, pay-per-use model makes capacity planning straightforward.

Ability to modify the structure of the written data: Data is received from our source system arrives as partially formatted JSON. This can cause issues with some processing libraries, such as Polars. We ended up rewriting the retrieved data using the ndjson format, which Polars can process with no issues.

When to Use This Pattern

The queue-of-work pattern is ideal when:

You need to make thousands or millions of API calls
Work can be divided into independent, idempotent units
Fault tolerance is critical (some items may fail, but others should continue)
You want to scale horizontally by adding more workers
Processing time per item varies significantly
You need to prioritize certain types of work

It may be overkill for:

Small-scale operations (< 100 items)
Tightly coupled sequential processing
Real-time streaming (consider event-driven architectures instead)
Simple ETL where Synapse pipelines or Data Factory are sufficient

Expansion to a second data source

We've since extended the pattern to a second data source in the same solution, Amazon Elasticsearch. This required us to go through the same process of evaluation and experimentation again, and brought with it a new consideration.

This time, the main benefit of the pattern is not in error handling, as the Elasticsearch service is highly resiliant and less susceptible to transient errors. However, it is highly scalable so means we can use a much higher degree of parallelism than with our original service.

Finally, size of individual data items is relatively small, but the volume of items we are retrieving is high, as each item represents a user event in the system. Regardless, we found that retrieving a day's worth of data at a time worked well.

Because of the high degree of parallelism permitted, a full ingestion of several year's worth of data takes seconds rather than the hours it would likely take if using a Synapse pipeline with an HTTP connector.

However, we had an additional consideration for this source; an IP allow-list on the service. In Synapse pipelines, this is dealt with using a Self Hosted Integration Runtime (SHIR) to allow you to ingest data from a well-known IP address. Fortunately, we already had the necessary infrastructure in place in our Azure Container App Environment, but this is something to be wary of when introducing additional moving parts into a solution.

Conclusion

The queue-of-work pattern transformed what initially seemed like an intractable data ingestion problem into a manageable, scalable solution. By decoupling work distribution from execution through Azure Storage Queues, we achieved the performance and reliability goals that seemed out of reach with traditional orchestration approaches.

The pattern has run in production for since mid 2024, ingesting large volumes of records weekly from multiple source systems. It's proven robust enough that ingestion failures are now rare, and when they do occur, the poison queue pattern makes debugging straightforward.

Key takeaways from this implementation:

Plan for the worst case: In a perfect world, we could potentially have lived with the sequential ingestion, as we'd only need to do the full ingestion once and then keep everything in sync. However, there are plenty of error scenarios, and these are what you need to make sure your process can handle.
Simple is fast: The basic queue-and-worker model introduces minimal overhead (<50ms per operation) compared to complex orchestration systems
Built-in resilience is valuable: Azure Storage Queues' native retry semantics and dequeue counting eliminated the need for custom retry logic
Horizontal scaling works: Adding workers linearly improves throughput without coordination overhead
Progressive completion aids debugging: Seeing results immediately rather than waiting hours made development and testing significantly faster
Determining batch sizes and concurrency limits is an experimental process: It will likely take time to determine the limits of the API you are using and there will be a number of factors to take into account. If you have a sandbox environment for the API, it will likely have different characteristics to production. And even if you find the limits of concurrency that an API can support, the owners of that API will likely not want you to push it to the limit as this could negatively impact other users.

When this pattern works well:

This approach excels when you can break work into independent units that don't depend on each other's results. The API ingestion scenario is ideal - fetching assets with IDs 1000-5999 has no dependency on fetching assets 6000-11999. If your workload requires sequential processing or complex dependencies between tasks, this pattern may not be the best fit.

If you've made it this far, thanks for reading! If you've got questions about implementing this pattern or would like to discuss your specific use case, feel free to leave a comment below. And as a final reminder, sample code to go with this blog post can be found in GitHub.

Rx.NET v7 and Futures On .NET Live talk and demos

Ian Griffiths — Fri, 27 Feb 2026 06:30:00 GMT

The On .NET Live team recently had me on as a guest to talk about Rx.NET v7 and future plans for Rx. You can see the talk here:

I showed some demos during the talk, and there was a request in the chat to make the source available. You can find the demos here:

https://github.com/endjin/rx-ondotnetlive-demos-2026

AIS.NET, Rx.NET and WPF

The first demo (source code at src/WpfAis) presents a map showing the current position of ships around the Norwegian coast:

This uses a data source provided by the Norwegian government. They monitor AIS data broadcast by ships around their coast, and make these messages available on a public endpoint. Endjin provides an open source suite of libraries collectively called Ais.Net that can be used to process data of this kind. This example uses the Ais.Net NuGet package to process the raw messages. It also uses the Ais.Net.Receiver package to retrieve messages from the Norwegian government's service, and to present those through Rx.NET.

Endjin maintains AIS.NET. I last wrote about it in a recent blog post about the performance improvements .NET 10 has brought to this library.)

This example shows how Rx.NET can be used to process live data streams declaratively. In particular, this effectively performs a 'join' over two kinds of message. Ships broadcast their names and types in different messages from the ones they use to report their locations (because their name and type tends to change a lot less often than their location). But the UI wants to combine this information so that it can display the ship's name and type over its location icon in the map.

The demo's viewmodel expresses this declaratively in Rx.NET:

IObservable> byVessel =
    receiverHost.Messages.GroupBy(m => m.Mmsi);
var vesselNavigationWithNameStream =
    from perVesselMessage in byVessel
    let vesselNavigationUpdates = perVesselMessage.OfType()
    let vesselNames = perVesselMessage.OfType()
    let shipTypes = perVesselMessage.OfType()
    let vesselLocationsWithNames = Observable.CombineLatest(vesselNavigationUpdates, vesselNames, shipTypes,
        (navigation, name, type) => (navigation, name, type))
    from vesselLocationAndName in vesselLocationsWithNames
    select (mmsi: perVesselMessage.Key, vesselLocationAndName.name, vesselLocationAndName.navigation, vesselLocationAndName.type);

This uses the C# query expression syntax to describe the processing we require. The Rx.NET library does all the actual work for us. (If you'd like more information about how this works, I've shown a version of this query before at this talk.)

System.Linq.Async, .NET 10, and System.Linq.AsyncEnumerable

During the show, I talked about how for many years, the de facto implementation of LINQ for IAsyncEnumerable System.Linq.Async was not, despite how the name makes it look, an officially supported library. It has always lived in the Rx.NET repo, and when we at endjin took over maintenance of Rx.NET, that meant we also became responsible for LINQ to IAsyncEnumerable! (See the announcement video for the old System.Linq.Async package for more information on the history behind this.)

With .NET 10, this is now finally built into the .NET runtime libraries. The System.Linq.AsyncEnumerable package (which is officially supported by Microsoft) is included as part of .NET 10.0 but is also available for use on older runtimes.

In the demo I showed how this creates a potential problem for projects already using System.Linq.Async. The project at src/SysLinqDemo targets .NET 8.0 and uses System.Linq.Async v6. If you upgrade the project to use .NET 10.0, you'll get errors about ambiguous definitions of the LINQ operators:

I showed that simply removing the reference to System.Linq.Async is the simplest way to resolve this. But I also discussed how you might not be able to do that because you might be using some other library that has a dependency on it. So I also showed how upgrading to the latest version of System.Linq.Async (v7) also fixes this problem.

Rx 7 and UI framework support

I discussed how our main goal with Rx 7 is to fix the bloat problems that have for many years afflicted applications that use Rx.NET in conjunction with self-contained deployment when targetting Windows. This table shows the impact this problem can have on a simple 'hello world' console app:

Deployment type	Size without Rx	Size with Rx
Framework-dependent	20.8MB	22.5MB
Self-contained	90.8MB	182MB
Self-contained trimmed	18.3MB	65.7MB
Native AOT	5.9MB	17.4MB

For framework-dependent deployment, in which the .NET runtime is presumed already to be available, adding Rx.NET has a relatively small impact. But for any of the other options, a reference to System.Reactive can double or even triple the size of the deployment! This happens when you target a Windows-specific TFM because System.Reactive ends up forcing your project to depend on both WPF and Windows Forms. You end up deploying a copy of both of those frameworks, which is what's taking up all the space here. The only way we can fix this is to split out UI framework support from the main System.Reactive package, putting these features into more specialized packages.

The effect of this change is that for the four deployment models shown, the impact of adding Rx becomes roughly 1.6MB, 1.6MB, unmeasureably small, and 300KB respectively.

To demonstrate what this will look like for developers, I showed a simple WPF application (in src/WpfWithRx) that uses Rx.NET v6, and which relies on its ObserveOnDispatcher method to ensure that Rx notifications are handled on a suitable thread for performing UI updates. I then upgraded the project to the preview of Rx 7 to show what developers will see when they do this:

This illustrates that they will now get a build failure on code that expects UI framework support to be in the main Rx.NET package. But it also shows that Rx.NET v7 has an analyzer that detects this, and explains exactly how to resolve the problem. We hope this will make the transition relatively smooth for projects affected by this breaking change. (We also maintain binary compatibility—although the WPF and Windows Forms features have been removed from the public-facing API, they are actually still present in the runtime binaries.)

We've done this by writing a custom analyzer. (This is in the Rx.NET/Source/src/System.Reactive.Analyzers folder.) This is an extra DLL in the System.Reactive NuGet package that gets loaded by the IDE (Visual Studio, VS Code, or Rider will all find it). Analyzers inspect source code to find problems, and make suggestions for how to fix them.

(Ideally we would have supplied a Code Fix as well as an analyzer. The .NET SDK's analyzer mechanisms allow analyzers to propose code changes, which show up as 'fixes' suggested by the IDE. However, there isn't currently a good way for a fix to suggest changes to the .csproj file, which is what's required in this case. It seems that the only supported API for making the changes the fix would need to make is the old EnvDTE automation APIs offered by Visual Studio. However, that isn't available on other IDEs, so it wouldn't work on VS Code or Rider. And even in Visual Studio, there isn't a supported way for a code fix to get hold of that API. This surprised us a little, because these IDEs do actually make suggestions for adding NuGet package references in some other situations, but as far as we can see, there's no way for our analyzer/code fix to trigger that.)

Now that we've got an analyzer DLL built into System.Reactive, it would also be possible to start adding other analyzer rules: perhaps we could spot problematic coding patterns. If you have any ideas for common Rx issues that you think an analyzer could detect, please suggest them in https://github.com/dotnet/reactive/issues

Reactive Extensions for .NET - Rx.NET v7 and Futures

Ian Griffiths — Wed, 25 Feb 2026 06:30:00 GMT

Ian Griffiths, Technical Fellow at endjin, .NET MVP, and author of Programming C# (O'Reilly), returns to On .NET Live to demo Rx.NET with live ship-tracking data from Norway's AIS network and walk through the major changes coming to the Reactive Extensions ecosystem in v7.

In this episode:

🚢 Live demo — streaming real-time vessel data with Rx.NET and AIS.NET, using LINQ queries over observable sequences to join, group, and display ship positions on a WPF map
📦 System.Linq.Async → System.Linq.AsyncEnumerable in .NET 10 — how LINQ for IAsyncEnumerable moved from the Rx repo into the .NET runtime, and what that means for your projects
⚠️ Rx 7 Preview — unbundling WPF and Windows Forms support from System.Reactive to eliminate the 90MB binary bloat in self-contained deployments
🔮 Rx 8 and beyond — plans for browser WASM and Unity support, improved trimability, and the path to production-ready Async Rx

Links & resources:

Katie Savage: Hello everybody and welcome back to On .NET Live, where it's our mission to teach the .NET community to achieve more. This morning, or afternoon, or evening depending on where you are, we have an awesome show prepared for you. I'm one of your hosts — this is Katie Savage, here with Cam Soper and Frank Boucher. And I'm sorry, Frank, for my horrendous French accent. I don't have one. It's nonexistent. But I'm super excited to introduce our guest today, Ian Griffiths, who is actually a returning guest. Ian, could you tell us a little bit about yourself?

Ian Griffiths: So hi, I'm Ian Griffiths. I am an author with O'Reilly — I've written the last four or five editions of Programming C#. I've also long been a Pluralsight instructor. Going way back, I got started in computing doing kernel mode device drivers and embedded systems, and then I've gradually been working my way up the stack since then — through medical imaging, broadcast video systems, and then into UI stuff. And then more recently into data analytics and of course applications of that to AI these days.

I currently work as a Technical Fellow for endjin, who are the sponsors of the Reactive Extensions for .NET these days. So the reason Rx.NET, which we're going to talk about today, is still alive and ticking is the generosity of my employers. So thank you very much to them. And that's me.

Katie: Awesome, thank you so much for taking us through that. And you gave us a bit of a spoiler, but I'd love to hear more about what you're talking about today. And I believe last time you talked a little bit about this topic as well.

Ian: I was last on I think two and a half years ago, and at that point endjin had just taken over the maintenance of the Reactive Extensions for .NET. So Reactive Extensions, or Rx.NET for short — also known by the NuGet package name System.Reactive — these are one of the oldest open source projects actually in the Microsoft world. It originally came out of Microsoft.

The Reactive Framework was originally created by the Cloud Programmability Group inside of Microsoft back in around 2008. And it's the same people behind it as were behind LINQ — Language Integrated Query. It kind of came out of that. So it was Erik Meijer's team, essentially, that came up with this back in the day. Same people invented LINQ, same people — some of the same people who are behind the async language features for C# as well. So a very interesting team, and they created this thing called Rx.NET.

And the way I like to describe it is: Rx is useful in any program where things happen. So that's quite a broad category, although not everything — there are exceptions to this. Programs that basically reach into a database where the data's just sat there and do some processing and then write some results out — nothing really happens there. There's sort of data in and data out, and it's all like a batch process. Whereas applications where things are happening live tend to require a slightly different approach.

So Rx has been most well known, I guess, in the user interface world, because there things happen — the user interacts with the application and you need things to happen in response to that, and Rx is really good for that.

But it can also be used in things like monitoring applications. So if you imagine, for example, utility companies — one of the projects that endjin has done with this is we worked with a utility company that provided broadband services, and we were modelling all of the diagnostic data from their multiple millions of routers in people's homes that were reporting information about the state of the network, to provide analytics that were live so that they could, for example, see problems that were unfolding in their network before the customers were troubled by it.

And so you needed to be able to monitor literally millions of devices and to get analytics — some of which were kind of specific to the devices, some of which were at a more aggregate level. So you wanted to know: has this connection gone down because someone's just accidentally watering their plants and has drowned their router, or is it because the exchange is on fire? Those are two quite different conditions and the appropriate response for those two things is different.

And so the sort of analysis you want to do in real time in response to this changing data — it's very useful to be able to do stream analytics at various different levels of detail. And that's the kind of thing that Rx is also very good at. It's less obvious than perhaps the user interface type approaches, but it's equally valuable in that kind of live data analytics world.

Katie: I can imagine, and I'm super intrigued already by the scenarios you've brought up. And honestly that tagline is perfect — I would love to put that on my LinkedIn bio: "useful on any project where something happens." That's amazing. But I'd love to learn more about Rx and I'd love to have you get into it.

Ian: Okay, well I think possibly the best thing to start with would be a demo. So I've got a little WPF application here. Now those of you who did watch the same talk two and a half years ago would've seen an earlier iteration of this — let me get that on the right screen.

So just to show you what the thing does before I get into the code: this is obviously a map control, and as this runs for a while you'll see gradually appearing on the map these little markers. And what we're seeing here is actually live information, generously provided by the Norwegian government, basically for free. They provide this online service that reports the location and movement of all — basically all — ships anywhere near Norwegian waters.

So there's this standard called AIS, the Automatic Identification System, which basically any large or moderately sized vessel that operates in international waters is legally obliged to have. It's basically GPS plus a radio transmitter, so the ships can report where they are. So if you have a marine GPS system, you can see where the other vessels are, and that's because they're all broadcasting their information.

Now this is all the ships in Norway, so this gets quite busy quite quickly. But you can see if I zoom in, you can see it showing where the things are, the direction they're heading in. The API also offers things like speed. It tells you what kind of vessel they are. It will tell you whether they're moored or whether they're anchored or whether they're moving or whether they're doing diving operations. There's quite a lot of information there. This is obviously live, so these things will gradually move around. You saw it populate as I started to go, so this is kind of a good example of the sort of data that we might actually want to deal with in Rx.

So just to give you a flavour of how this looks, I'm going to kill it off and we can start to look at the code. And actually what I'll start with — I've got a couple of projects here — I've got a simpler console app that lets us just see how the API works at its most basic level.

So if I go and open up the Program.cs here: we have this API that lets us connect to servers that can provide AIS data. So that IP address happens to be the IP address of the service provided by the Norwegian government. If you connect to them on this port, they will give you AIS data.

And this library we're using here is a library called Ais.NET, also written and maintained by my employer, endjin — thanks again. So this is a free library that lets you process AIS messages and it exposes them through this property here, which is of type IObservable.

So here, this is where Rx comes into the picture. This interface, IObservable — it's actually built into the .NET runtime class library. So you don't need any extra libraries to have this. This has been built in since .NET Framework version 4.0. This was actually baked right into the framework.

All of the other support around it isn't, but that interface is baked in. This is the core of Rx. It represents a sequence of things, and in this case the things are AIS messages. So this basically says: if you've got a receiver host plugged into an AIS data source, it will give you a sequence of messages.

And right now this program is incredibly simple. It says: I would like to subscribe to that source of messages, and each time a message arrives, I want it to invoke this function, which just says "okay, what kind of message was it? Was it one that tells us the vessel's name? Does it tell us where the vessel is?" and prints out the details.

So if I run this one — and get that onto the right screen, let's make this a little bit bigger — we can see what the raw messages look like. I'm going to stop that and take a look at it, because this is actually going to show you a challenge with this data source that we're going to use Rx to address.

So Rx isn't just about receiving sequences of messages. If that was all it did, it wouldn't be very interesting. But it allows us to perform processing on those messages in a declarative way. So let's look at these messages.

So this is basically the raw data — or some of the raw data — being received from the ships, and they send different types of messages. So sometimes they send out a message saying "I am at this location and I am facing in" — that's not really a compass bearing, because compasses don't go up to 511. That's a magic value saying "I don't know which way I'm facing right now, so please ignore this," and "I'm not moving." Some of them will have more interesting information.

So there we go — that looks like a genuine compass bearing. So this one says: "I am currently engaged in fishing. I'm currently at these GPS coordinate locations, and I'm facing in this direction." And since I'm fishing, I'm not currently moving. So you can get this information about what vessels are doing, and they're all tagged with a unique vessel identifier.

So when you install one of these GPS systems, they have a unique vessel ID. Now some of these things say what the name of the boat is, so you can see this one here is saying "my name is Cans 21." And a lot of spaces — sometimes they seem to pad the space out with at symbols for some reason. Don't know why, but they do.

Now on the map, I wanted to label each of the nodes to say "the vessel here is called this, the vessel there is called that." But there's a slight problem, because the messages that tell us where they are don't include the name, and the messages that include the name don't say where they are.

And the reason for this is that boats tend to move more often than they change their names. And so they do not broadcast their names as often as they broadcast their locations. And because this whole radio standard is actually using relatively low frequency radio, the bandwidth is minuscule — there's maybe hundreds of bits per second. It's incredibly low bandwidth, and so they don't have a lot of space in these messages. And so they try and maximise efficiency.

But then, okay — if I'm going to draw these things on a map, when I receive a message that says "we're at this location doing this thing in this direction," how am I going to reconcile that with the name that I'd like to stick on the label?

Essentially what I want to do conceptually is a join. It's like if this information was in a couple of database tables — I had a table of "here are all the names" with the primary key of the vessel ID, and another table saying "here are the locations" also with the key of the vessel ID — I would just join across those and then I'd be able to get the answer.

That's fine if the data's already there, if nothing's happening, if it's not live, if it's just data that's sat there that I can query. But how am I going to do this when it's live data?

Well, this is where Rx comes in. So if I go back to the WPF version, which is actually labelling these things, I've got the same basic code. I've got a view model here that's powering the display — this is basically sitting behind the UI that you see. And as with the console app, again I just create myself a receiver and a host for that receiver. This is the thing that gives me observable messages.

But now I'm doing some more interesting things. I am saying I want to not just process the messages — I want to start running processing operations on them. In this case, I would like to group the vessels by their unique ID. So what this says is: rather than having a single stream, I want to get an observable stream of observable streams. So each time this sees a vessel it hasn't seen before — each time it sees a message where the vessel identifier is not the same as one it saw last time — it's going to emit a new group as the item that comes out of this observable.

So this is an emitter of groups. It's a sequence of sequences. And then what I can do is say, okay, within each stream I'd like to pick out the different types of messages. I'd like to pick out the navigation messages and the messages that say what the name is, and also the ones that say what kind of ship this is — is it a fishing vessel? Is it a tug? Is it a tanker? Is it something else?

And then combine those together. So I'm sort of doing a join here. In essence, I'm not actually using join syntax, but logically it's doing that sort of thing. I'm telling the Rx library I would like to combine these streams together to find the latest location, name, and type information within this single vessel stream. So basically this is going to emit a series of: this vessel is called this name, it's at this location, and it's this type. And every time any one of those things changes — if it changes location or changes speed, or if less likely it changes its name or its type — then I'll get a new message comes out of here.

And then finally, I basically merge them all back together again. This actually, because it's got two from clauses in here, turns into a LINQ SelectMany, which is a flattening thing. It flattens it back down again. And the net result is the messages that actually come out basically tell you the combined vessel name, location, and type for each vessel.

So anytime you get new information about a vessel, it comes out of here, and this is what we can then use to update the UI. So now I'm just moving into the world of WPF — I'm picking a colour to paint the ship based on its type, I am setting its location on the map based on the reported location, and I'm labelling it with the name.

So we're now back in basically the world of WPF data binding at this point. The point is I can use C#'s built-in query syntax — or if you prefer you can just write LINQ queries as method invocations. Some people prefer one style, some people prefer the other. They are exactly equivalent. You can write it either way, and you can essentially execute queries over live data streams.

So when we run — hang on, I've restarted the console app. Let me go back to the WPF app, click the right button. So when this runs, it starts receiving messages. Initially it's going "okay, well I've got a location but I don't have the name yet," or "maybe I've got a name but I haven't got a location." But once it starts seeing messages of all types from a single vessel, it goes "oh, okay — right now I've seen a location and a name and a type for this particular one here. Now I can actually emit that as a single message with all three." Hand that over to the map, and off we go.

And if we sat here long enough, you would gradually see them moving across the screen. Although, being ships, they're going to move quite slowly at this scale, so I'm not going to sit and make you wait for that.

The basic idea then is we've got this abstraction — observable — which we can do all the same things with that we might do with an IEnumerable, because they are essentially the same fundamental idea. An IEnumerable is just one thing after another. An observable is just one thing after another. The difference is: with an IEnumerable, we as the programmer say "I'd like the next item please." You write a foreach loop over the thing — "give me the next item, you do some work, give me the next item, you do some work."

So we as the developer are pulling items out of the source, so to speak. Whereas with Rx, the source decides when it has something for you. I can't walk up to this API and say "make that ship over there emit a message for me." That's not going to happen. The ship's transmitter will emit messages when it wants to.

And so that happens on its own schedule. I as a developer am not in control of that. And so Rx gives me a way of expressing that by having these things be emitting sources. So we have what's called a push-like way of consuming them, where the source delivers messages into us. So that's the fundamental concept.

It's designed specifically to be very similar to IEnumerable. It's just that you receive messages when the source has them for you, rather than retrieving them when you are ready to process the next thing. That's basically the heart of Rx.

And then the same LINQ query language is basically available on both — pretty much anything you can do with LINQ to Objects like searching, filtering, joining, windowing, all these sorts of things are available. And actually, as it happens, Rx provides a bunch of extra operators that are specifically temporal in nature, that wouldn't really make sense for a database. So you can, for example, say "I would like a sliding window that is two seconds long and I'd like you to give me all the events that happen within a two-second window." So I can process those that way. And that obviously only makes sense in the presence of timing. And this being a push-oriented thing, timing is inherently there in a way that it's not with a raw database.

So that's the heart of Rx.

Katie: It is super cool. When you started initially it was like "okay, we can buffer the messages and stuff" — no, just group them and it works. That's awesome.

Ian: It's a declarative style. So rather than you having to think "what code am I going to write to process it, where am I going to put these things, how do I bunch them together?" — if you can express the semantics of what you are doing through the language of LINQ, through the standard query operators that LINQ provides, then you don't have to think about how you're doing it. You can just say what you would like done.

Katie: That's pretty cool. Super cool. And Ian, one person here, John, is asking if there's a link for this code. They already want to start practising with this.

Ian: Oh right. So I wrote this like a couple of hours ago, so no, not yet. There is a notebook you can get hold of — we used to do this as a Polyglot notebook. Unfortunately the Polyglot notebooks are kind of going away. I think that project is winding down now. And so I rewrote this as a plain WPF project this afternoon so I could be sure I could run it. I will endeavour to make this available. I will do a blog post about this just to follow up. So I will make this code available to people who want it. If you go to endjin.com, then you can find our blogs there. You can find me. I will make this available later this week.

Katie: Perfect. Thank you. Good question, John.

Ian: So what I was hoping to talk about today, if I may, is some of the stuff we've actually been doing in the Rx project. So this is kind of the intro, entry-level stuff. But we've actually been doing some things lately.

One of the things I talked about the last time I was on is documentation and kind of learning Rx, because one of the things with Rx is it's very powerful, but it's not the easiest thing in the world to learn. People often struggle to get their heads around it for a while and then eventually reach a kind of "aha" moment, where it's like "oh, I get it now," and suddenly you can't imagine programming without this whole mechanism. And getting people to that point has been challenging.

Now last time I was on, I was talking about — there was a site called IntroToRx.com, which was written actually like 14 years ago, believe it or not, by a guy called Lee Campbell. But he hadn't updated it since then, so it was very good but it was also kind of out of date. But he very generously allowed us to take that content and update it. And since the last time I was on, that is now done — up-to-date site, IntroToRx.com. So if you want to learn in detail about this stuff, that is absolutely the place to go.

IntroToRx.com — it's available. We also take contributions. The community is regularly submitting changes or fixes or enhancements to that, and so it's a live, up-to-date place to go to learn about this stuff. I would check that out if you are in any way interested in this.

So, any questions you want to raise as hosts before I — because I know I can fill the entire hour without noticing it's gone, so I don't want to completely take things over.

Cam: Well there was — Cecil was asking — he thought that Rx and Ix operators were moving into the core framework.

Ian: Ah yes. This is exactly one of the things I wish to talk about. So let me just — I'm going to open up a web browser. Probably didn't want to do it in the same window that I'm running the session in. Two seconds and I've lost my mouse pointer. Where's it gone? That's what happens when you have a lot of screens, people.

Katie: I'm just jealous. I wish I had that problem.

Ian: Okay. So this is actually the source code for the documentation, but this is up on the — if you go to the main .NET website you will — hang on, no, this is not the one I meant to do. I meant to look at the one that's on the main .NET website. Hang on a second. .NET 10. It's this one here.

So this was announced as a breaking change in .NET 10, which is that they have added support for LINQ to IAsyncEnumerable. So what does this have to do with Rx.NET? Because IAsyncEnumerable is this interface that was introduced to the .NET runtime roundabout .NET Core 3.1 time.

Let me just clear that out the way to make a bit more space. Right. So the basic idea is it's like IEnumerable — it's a sequence of things — but it's async. So you want to produce items in a way where you sometimes need to await in the implementation and block and have a task that might complete asynchronously. IAsyncEnumerable lets you model that.

In this example I've written this as a C# iterator method, and then they introduced in C# 8, I think it was, this await flavour of the foreach loop, which is designed to provide direct integration for this. So this is IAsyncEnumerable.

But when it was introduced back in .NET Core 3.1, the .NET team did not provide an implementation of LINQ for this. So you could not, for example, do .Where(x => x is divisible by two). For example, if we try this, we get a squiggly saying "the type arguments for this method cannot be found."

However, the Rx team said "that's fine, we have an implementation of this that you can use." If you go to the NuGet package manager and if you search for System.Linq.Async — this says "provide language integrated query over IAsyncEnumerable sequences." And I'm going to add actually version six, and I'm going to just quickly zoom in on something if this will work. You may notice that the project URL is github.com/dotnet/reactive. That's the Rx repo. Mysteriously, this implementation lives in the Rx repository.

If we click — that squiggly goes away. So now if I run this, this basically reads input lines and emits them as numbers. So if I type 42, that will come through. If I type 43, it won't come through, because I filtered it to say I only want the even numbers. So those come through, those ones don't.

So this made LINQ available for IAsyncEnumerable, and this was released almost immediately after .NET Core 3.1 shipped.

Now, why, you might wonder, did the Rx team do this? What on earth does this have to do with Rx? Didn't I just say that Rx is all about push, whereas foreach is all about pull?

Well, here's the thing. The team, as I mentioned — the team that invented Rx, the people in it were also behind LINQ. And actually they invented IAsyncEnumerable first. IAsyncEnumerable had been around in .NET for about five years before it appeared in the .NET Core runtime itself, and it was written by the Rx team. They implemented this thing called IAsyncEnumerable themselves. And then later, several years later, the .NET team said "oh, that's useful, we should build that in." And they did.

And so the Rx team said "alright, we should probably stop trying to define it ourselves." They removed their own definition of IAsyncEnumerable, but it's like — hang on, we have a complete LINQ implementation for this. They'd already done it. They'd already done all the work to support LINQ for IAsyncEnumerable long before the .NET runtime and the C# compiler team built support for that interface in.

And so it was like "well, we could just make this available." And so they did. They made System.Linq.Async available as a library, because they just had it and people wanted it.

There was a slight problem with this, which is: by this time there wasn't really any full-time support inside of Microsoft for the Rx project. So it had started as a fully funded internal project and it turned into a fully community-supported project. And the result of this is that loads of people look at this and think "well, it's called System.Linq.Async — that must be part of the .NET runtime library. That must have full support. I must be able to demand feature enhancements and bug fixes for that, just like I could for any other bit of the .NET runtime." And those of us who are maintaining this in our own time for free were starting to get support requests from people who genuinely believed we were being paid to do this, when we weren't.

And when we decided to take over — to offer to take over maintenance of the Reactive Extensions for .NET — we didn't really sign up to also become the maintainers of LINQ for IAsyncEnumerable, but it sort of happened because it's the same repo. So we sort of became responsible for that as a direct result of taking over the repository.

So David Fowler actually said "this doesn't seem right, we should probably build this into the .NET runtime libraries." And he said that about two years ago, and that eventually came to fruition with .NET 10. So as of .NET 10, you don't need to do this. I can actually come in here and remove this library and this will go back to giving me an error. But if I now upgrade the project to .NET 10 and save that and build it — that squiggly should go away again. And now it's there, but now it's actually in a different place. If I mouse over this and if I zoom in again, you can see the location is .NET/shared/Microsoft.NETCore.App version 10. So the .NET Core runtime libraries, in a library called System.Linq.AsyncEnumerable — it's now built in.

And so you would think this would be just a slam dunk for us on the Rx maintenance team, but you would not believe how much work it has taken for us to step neatly out of the way of this development. Because the problem is: a lot of projects will have that reference to System.Linq.Async in them, and then they'll upgrade to .NET 10.

Let me show you what happens when you do that. If I install this again, if I put the library back — so now I'm on .NET 10 and I've got System.Linq.Async — and now I have a problem. It now says "the call is ambiguous between the following methods or properties." There are two implementations of LINQ for IAsyncEnumerable. There's the one that's built into the .NET runtime libraries, and there's the one that's in the System.Linq.Async package.

Now, you can fairly easily solve this by removing the reference to System.Linq.Async — unless you didn't add that reference in the first place. What if you are using some other library that depends on System.Linq.Async? Now you can't get rid of it. So that's fun.

So what we did is we actually released a new version of System.Linq.Async version seven. And if we update to that and now go back here, the problem goes away again. And essentially — if I mouse over again and zoom in — you can see we're back to using the one built into .NET Core, into the .NET runtime libraries.

So what essentially we've done is we've removed all the stuff that's now in the .NET runtime. So if you're on the latest version of our library, it no longer tries to provide you with LINQ for IAsyncEnumerable, because .NET does that for you.

However, it's more complicated than that, because what if you are using a library that was built against System.Linq.Async version six, and it's not been built against version seven? It's going to expect to find all of LINQ there in binary at runtime, because it's compiled against our DLLs and not the ones in the .NET runtime.

So we actually have to ship two completely different sets of binaries. If you look inside these NuGet packages, there are reference assemblies that say "oh no, we don't provide an implementation of LINQ for async anymore." But there are runtime assemblies that do continue to supply that for binary backwards compatibility. So it's all a bit hairy to make it work, but the net result is it should just work how you expect.

Basically, the only problem people are going to see, we hope — and so far this has panned out — the only problem we're expecting people to see is if they end up with a reference to System.Linq.Async version six and they upgrade to .NET 10. Or if someone else brings in this new runtime library, because you are allowed to add a reference to System.Linq.AsyncEnumerable even if you're on .NET 8 — they actually made it work down-level. If someone's done that to you, you end up with this ambiguous reference error. But you can fix it by just upgrading to the latest version of Ix.NET.

So we will eventually be deprecating the System.Linq.Async package. We're going to do that fairly soon. The only reason we haven't already done it is we wanted to make sure that this all worked for people. I had some sleepless nights when .NET 10 shipped, thinking "am I going to get a million bug reports because I've not thought of something and this is all going to go wrong?" But so far it seems to be fine.

So we are going to mark System.Linq.Async as deprecated so that people can stop using it eventually. But for the meantime, there's this kind of off-ramp where you just use this thing.

There's one other issue though, which is that there are some features we provided in System.Linq.Async that were not replicated in .NET 10. So for example, there's this slightly strange method called AsAsyncEnumerable, and there's an equivalent method in LINQ to Objects. The .NET runtime libraries do offer an AsEnumerable method — it basically says "I want to hide the concrete type of this thing and just turn it into the interface." It says "erase the type for me." I might have some MyAsyncEnumerableImplementationType and I'd like to treat it as an IAsyncEnumerable so that only the IAsyncEnumerable extension methods are available. It's occasionally useful to do this. And for some reason the .NET runtime library did not include that method when they did their own version of it.

So we continue to provide that, but it has moved. There is a new library. If you look in your dependencies, if you have a reference to the latest System.Linq.Async, you'll see that we have transitively given you a reference to System.Interactive.Async. Now this has always existed — this was always the library where we put non-standard LINQ-like things that aren't really proper LINQ, they're sort of extensions to LINQ. They live in this library for the async version. And there's also System.Interactive, which has existed for like 12 years — it's where we put non-standard but LINQ-like operators.

And so we've moved everything into there. So the idea is you would stop using System.Linq.Async — you would just remove your reference to that. If you need any of the functionality that we provide and .NET 10 did not copy over, you would instead add a reference to System.Interactive.Async, and then you're good.

So that was done back in November, and we'll be deprecating that library fairly soon. That is my not-very-brief answer to Cecil's excellent question, because I wanted to talk about that. People need to know.

Frank Boucher: Interesting story. I never thought about the merging stuff — the impact of what it can cause.

Ian: Well, I have an even bigger version of that that I want to talk about as well today, which is one of the big things that we're doing for Rx version seven.

So we shipped Rx 6, couple of years ago. We shipped Rx 6.1 earlier this year — it has some handful of minor new features in it, some additional operators and community contributions. But Rx v7.0 — we are trying to fix a problem I talked about two and a half years ago and that we haven't managed to fix yet. So let me talk about what this problem is.

I have another window open here. Let me get it onto the right monitor. Right, so — do we have a drum roll, Cam? Right, so this is a pretty simple WPF application. All I'm doing is creating an observable sequence. The nature of the sequence isn't terribly interesting — it just produces numbers at kind of randomly spaced intervals. It counts up, but it does so at a slightly lumpy speed. And then in my subscription, I'm just putting the output directly into a property of a control.

So if I actually run this — it just shows increasingly high numbers at slightly random intervals. Not very interesting. But the point of this is to illustrate one of the things you often have to do in user interface programs.

If I comment out a magic line, this will stop working. We've immediately hit an exception — InvalidOperationException. Let me close the live preview. The calling thread cannot access this object because a different thread owns it. If you've done much user interface programming, you will be familiar with this problem. Basically almost all user interface technologies require the UI to be updated from the right thread. So any window handle in Windows belongs to one particular thread, and most UI frameworks don't like it if you try and change something from any different thread.

Back in the day, Windows Forms version one used to just break weirdly when you did this — it didn't notice you'd got it wrong. It would just gradually melt its innards and would start to go wrong. Now it actually detects it and throws an exception, which is an improvement. But basically you've got to be on the right thread.

So in Rx we offer these helper mechanisms where you can say "okay, I do want to subscribe to this source, but actually I need to observe it on a particular context." I can't just take the raw notifications because I happen to know this source is going to deliver them to me on a thread that is not useful.

So if I add this ObserveOnDispatcher, what this says is: I know I'm in the WPF world, and so I want the dispatcher for whatever thread I'm on when this method runs — the current thread's dispatcher — to be captured. And anytime this source emits a value, I would like to basically redirect that back onto the user interface thread before I handle it.

So now if I run this — I'm going to stick a breakpoint here. So when it tries to raise the events, if we look at the thread that I'm on, you can see I'm on some sort of thread pool worker thread up there. So if this were to come straight through, it would not be the right thread to hit the UI. But if I now hit F5 and see — well, okay, now we've received that. What thread are we on now? Well, now we're on the main thread, because I told Rx that's what I need.

So the point here is that Rx offers integration with certain UI frameworks. We do this stuff for WPF and a few other helpers. We also have ones for Windows Forms — you can do ObserveOnControl. We also do ones for UWP, or indeed anything that uses the Windows Core Dispatcher. So we have ObserveOnCoreDispatcher.

Here's the problem. All of this today is built into the same library. If you want this, you just use the same System.Reactive NuGet package as you do for anything else. So if I were to look at this project, the only NuGet package reference that I've got there is the standard Rx one.

Why is that a problem? Well, in a way it's not a problem — in a way it simplifies things. It means I just said "Rx please," and if I happen to be using WPF, then I get the WPF features. They're just right there. The problem, however, is when you start doing things like AOT (ahead of time) compilation or self-contained deployment.

If you want to build this WPF app into a self-contained form where you don't have to pre-install the .NET runtime to use it — this has been a mode that's increasingly well supported in recent versions of .NET — if you want that to work, then the problem is that including Rx means that it will now ship a complete copy of WPF and Windows Forms with anything you build, whether or not you're using either of those frameworks.

So if, for example, you are only targeting .NET 10 with the Windows-specific TFM because you happen to want to call some API that's in there — let's say maybe you're writing a console app that wants access to the sensor framework that's available in the Windows API. Maybe you want to read orientation data and report that over the network. You don't have a UI, but the problem is: because you said "I want the Windows flavour," Rx goes "oh, well then you must want WPF and Windows Forms," and so your binaries get about 90 megabytes larger as a result of this. Which is not good.

The reason this was missed at the time is that back when the decision to unify everything into a single package was made, there was no such thing as WPF on .NET Core. That didn't come along till .NET Core 3.1, and this decision was taken earlier than that. You only had WPF if you were doing .NET Framework — classic .NET FX — and there was no way of doing a self-contained deployment if you were building a .NET Framework app. You had to install the .NET Framework on the target machine before you could install your app. And so this whole problem didn't arise.

Since they made this decision, it's now become a real problem. If you are targeting a Windows-specific target framework moniker and you include Rx, and you build a self-contained deployment of any kind, you now get 90 megabytes of unwanted stuff. If you turn on trimming, it goes down to a mere 60 megabytes, which I guess is slightly better, but it's still an awful lot.

So we wanted to fix this, and actually I said last time I was on, two and a half years ago, I said "we're trying to fix this, we haven't worked out how to yet." We now think we have worked out how to.

So if you go to the NuGet package manager — if you are using Rx, there is a preview of Rx 7 available on NuGet. And if we do this, it unbundles the UI framework support. So if you say "I want System.Reactive," you just get System.Reactive. We don't give you the WPF stuff anymore.

Now, actually the runtime binaries still have it, because we've had to do the same binary compatibility thing to make sure that anything built against older versions of Rx that was expecting everything to be there will still work. But we no longer declare a dependency on WPF. We no longer force your application to depend on WPF. And so this gets rid of this problem.

Now the obvious downside of this is that if you are building against this, it will now say "well, that method doesn't exist anymore." And that's fine — you just need to add the right library.

But what we've done is we've added an analyser that detects when you've done this and says "oh, you are trying to use ObserveOnDispatcher, and that used to be built into System.Reactive. Now it isn't." So we thought, rather than confusing people and them going "why is this method gone away?" — we're actually telling them "okay, you now need to add a reference to this package for this to continue working." So people at least get told what to do as part of the upgrade.

So it's like "oh, okay — add a reference to System.Reactive.WPF." Let's go and find that. System.Reactive.WPF — there it is. We install that, and now because we've actually asked for WPF, we're going to get it. And that's fine. If you ask for it, you get it. If you don't ask for it, you don't get it. That's the new model.

And so we are hopeful that this is a relatively painless way of getting past the problem. Because people that were not using Rx because of this — the Avalonia UI project abandoned Rx because their use of Rx meant that their binaries were 60 megabytes bigger than they needed, and they felt it was less painful just to stop using Rx than it was to force that on their users.

And so our goal is to say, well, we'd like them to come back. Ideally — maybe they never will. Maybe we've burnt that bridge. But we'd like Rx to be a good choice for anyone building Avalonia UI apps. And so we have to unbundle the UI elements. So that's the big change with Rx 7.

We've gone to great lengths to ensure binary compatibility is maintained, using similar tricks to the ones I just described for the System.Linq.Async stuff as well. It's currently in preview — we've had about 40,000 downloads so far. At some point we're going to have to pull the trigger. No one's told us it doesn't work yet. I would encourage people to try this and see if it works for them, because we suspect no one's told us it's not working because they haven't tried it.

We've done our utmost to test this in every way we can think of, and hopefully it's fine. But sooner or later — this year, not too many months from now — we will go for a proper release of this and then we'll find out whether it's as good as we think it is.

So that's the big thing coming in Rx 7. And we're basically making that the only feature of Rx 7, because we want to separate that change out from everything else. And then further feature work — there'll probably be an Rx version 8 fairly quickly on the heels of Rx 7, where we actually do new feature work.

So I can take a breath for a second.

Frank: I want to congratulate the team for the clarity of all those error messages, because not everybody makes that effort, and all those messages were very clear and helpful and all those things. So I think this is a great effort. It looks like the team cares.

Ian: We do really care. We've tried really hard to — oh, congrats on that. It's been controversial because there are a lot of people who would rather we just left it as it is. That's been quite a widespread opinion. But we know there are projects that have walked away from Rx because of this. And so our view is that that's not an acceptable solution.

I'm also just going to address the other thing people say, which is: "Well, can't you just tear it up and start again? Just build a new Rx. Just say it's the end of the line for System.Reactive. Just do System.Reactive.Two or System.MoreReactive or whatever it might be." Super Reactive.

That doesn't work. That absolutely does not work, because if you end up with dependencies on both those libraries — let's say you've decided to use the new Reactive, and then you take a dependency on some component that's using the old Reactive — now you get those ambiguous method errors again. The same problem I showed you with System.Linq.Async. Because you've now got two completely different implementations of Rx, both saying "I provide Where and GroupBy and Select and so on for IObservable." And the compiler doesn't know which one you want, and that's an absolute nightmare.

So the so-called "clean break" solution is no such thing. You basically have to fix this in System.Reactive if you're going to have any hope of proper compatibility going forward.

So there is a huge design document about this. If anyone wants to go into the details on this — let me try and find where that is. So if you go into the Reactive repo and look at ADR — Architectural Design Record — and look at the Package Split ADR, this is my attempt to summarise everything you need to understand in order to solve this problem correctly.

"Summarised."

Yes, it's not as simple as people think. So this explains everything that anyone has suggested to us, because we sought feedback from the community on "what are we going to do about this?" And we've evaluated every option and written the pros and cons about it and explained why it is we've chosen the solution we've eventually gone with. And we also explain exactly what the problem is as well. So you can see — that's how big a self-contained exe is: 90 megabytes. It grows to 182 megabytes if you add Rx. And that's just not acceptable, we don't think.

Katie: So that's the big deal with Rx 7. That makes sense. I mean, even just looking at that document, I'm like, I believe you, Ian, I believe you fully. We do have a couple of questions that have come in over the last couple of minutes that I want to make sure we get to.

Cam: Sure. This one from John came in a little bit ago. He says that he's a QA automation person and he's wondering where can we access TFS to log bugs?

Ian: As in, if you find bugs in the System.Reactive library, the place to log bugs would be — it's a GitHub repo. So just go to github.com/dotnet/reactive/issues, and anyone can report issues. So I'm not sure if that's what he actually means, but we're not on TFS, we're on GitHub. Although we do use Azure DevOps to do our build, because that's what the .NET Foundation originally set us up with for this project. But that is the place to go to report bugs.

Katie: Perfect. And I think Cam had sent a link to this repo earlier in the stream, so go ahead and search for that. And John, if that wasn't what you were asking, feel free to clarify and we'll ask again.

There is another question from MC Nets: is this Reactive library used by UNO Platform in MVUX, or did they write their own libraries?

Ian: I don't know about Uno. I mean, it's possible to use this in Uno. One thing I would say though is that we know there are two — at least two — environments where we have some problems. One of which is browser WASM, and the other of which is Unity, the games development environment or the 3D development environment. And both of those are to do with differences in threading in that world.

And that's actually one of the things we're going to work on with Rx 8. We want to address the problems that mean that Rx has some issues in browser WASM and on Unity. And the thing about Uno is that it can end up running in the browser. So I would anticipate that it has the same problems there.

I also know that people have written — there've been various attempts to fork Rx and do new versions of it, because for quite a while it was basically unsupported, and so people were asking for features to be done and nothing happened for a couple of years, and so people went off and did their own forks. Quite understandably.

Our goal is really to try and make it good for all .NET applications, and so that's a big driver of Rx 8 — dealing with these things. I don't actually know specifically with Uno — they may well have done their own libraries. But our goal is to get to the point where they don't have to. And if they wanted to come back to the original Rx, they could. But equally, if they're happy with their new solution, then more power to them.

Katie: Totally. That makes sense completely. I think that's all the questions I'm seeing. Frank, Cam, anything from you?

Cam: No, I've got nothing. I just wanted to comment on the work that you guys have done recently to handle the integration with .NET 10 and those various cases of ambiguous references. I am very impressed with that. Ian, I thank you for all your work you've done on that. And I know we've got a lot of comments out in the chat about how useful Reactive has been for them, and I think I saw somebody refer to the entire Reactive team as "goated." So I think we'll take that as a compliment.

Ian: High praise, high praise. So I should obviously give credit to the people who came before our involvement, because the original team at Microsoft and then the open source group that kind of carried it forwards — they eventually were unable to continue to put the work into it and so it became more abandoned for a bit. But it wouldn't be here today if it wasn't for those people. So there's been many, many people before our involvement, without whom it just wouldn't exist at all. So I can't take too much credit. I'm just trying to keep it available for the next generation, because I think I hugely admire the work that went into it before we got here and want that to continue to be available for everyone, because I think it's really good.

Frank: Do we have time for a last question? I just see — MC Nets was asking on Twitch: is it only lists, is it only objects that we can observe, or can we look for any change in the class or something like that? How do I trigger a change if some of the properties change?

Ian: Right. You want to look at a couple of projects out there. So there's a project called ReactiveUI, which is an Rx-based project. It uses Rx deeply to power a user interface-based framework, and that in itself depends on another library. I'm going to get the name wrong, so just go look at ReactiveUI and you'll find it that way.

There's basically a whole model for doing property changes integrated with Rx. So you can say "I've got this Rx stream, I'd like to present it through a property," or you can say "I've got this property that I update, I would like to turn that into an Rx stream." It's called something like Dependent Data, but that's not the right name — but the ReactiveUI library uses it. So that's kind of the best way in for that. So yes, absolutely you can do it.

There's a couple of things I just wanted to quickly talk about if we have a couple more minutes. Do we have time?

Katie: We've got about five minutes.

Ian: Okay. So I just wanted to say the other things we're aiming to do for the next version, so people know what we're working on. We want to make sure that we are usable for as many .NET applications as possible, and specifically we want to address WASM and Unity.

We also are going to improve the trimability support. So Rx 6 did make Rx trimmable, because it used to be: if you added a reference to Rx, that's one megabyte of extra stuff in your executable. In Rx 6, we added basic trimability, so it could chuck away most of Rx if you weren't using most of it. But we didn't do a complete job — we did just enough to be useful. We're going to do it properly with Rx 8.

The other thing we are going to come back to — we have been pushing this along in the background, but it's been slow for various reasons — is Async Rx. So just as you have IAsyncEnumerable, there is IObservable, but there's also IAsyncObservable. And that is a thing that the original Rx team kind of never really finished. It was always in prototype phase.

We did get a preview version of that out there on NuGet so you can use it today. I just wanted to explain to people why it's still in early preview. And the basic reason is we don't have a complete test suite for it yet.

What we are gradually doing is updating the way the Rx test suite works so we can have a single test suite that works across both regular Rx and Async Rx. And once we've got that, we'll then be happy that we're at production quality for both libraries. We want to test them to the same level and we can't do that yet. So until we're able to test Async Rx to the same extent that we do — because we have thousands of tests for proper Rx — until we can get those applied to everything, we're not happy to say that people should be using it. But we are still working on it. It's been slow because other things have taken priority and we really needed to fix this bloat issue. But that is coming. We're still working on it, for people who were wondering where it's gone.

Katie: That makes sense. Thank you so much for all those updates, and thank you for being here today, Ian. This was an incredible show. You've gotten a lot of love in the chat and we super appreciate it. And thanks to everybody who is watching. We're here every Monday, same time, same place — On .NET Live. Next week we'll be here with Mattias. Super excited for that one as well. And I hope you have a great rest of your day or evening. Thank you very much.

T4 templates on modern .NET

Ian Griffiths — Wed, 18 Feb 2026 06:30:00 GMT

T4 is a popular .NET-based templating language. Originally, it could use only .NET Framework, but in 2023, Microsoft added a version of the template tool that could use .NET 6.0. At some later point they added support for .NET 8.0. (As I write this in February 2026, there was not yet support for .NET 10.0.)

However, this modern .NET support is minimal, and is not used by default. The Visual Studio integration continues to use the old .NET Framework implementation. To use the new modern .NET support, you have to run the command line tool manually, or adapt your project files to invoke the tool for you.

The Rx.NET and Reaqtor codebases make extensive use of T4. Up until now we've relied on the built-in Visual Studio support, but the inability to use modern .NET features is starting to become a problem. This post explains what it takes to move projects that use the old .NET Framework T4 support in Visual Studio over to using T4 with modern .NET.

A quick introduction to T4

The documentation seems coy about what the name T4 means, but some say it stands for Text Template Transformation Toolkit. If you've not used T4 before, it's a bit like a Razor page–it can contain a mixture of plain text and C# code. For example:

<#@ template language="C#" #>
This is some plain text that will be emitted verbatim.
<#
  // This code is executed, so it won't appear in the output, but it
  // changes how the output that follows is produced.
  for (int i = 0; i < 5; ++i)
  {
#>
This is also emitted. It's in a loop, so we get many copies.
<#
    // This is another code block.
  }
#>

If I put that in a file called SimpleTemplate.tt and then run this command:

TextTransformCore SimpleTemplate.tt

it produces a file called SimpleTemplate.txt with this content:

This is some plain text that will be emitted verbatim.
This is also emitted. It's in a loop, so we get many copies.
This is also emitted. It's in a loop, so we get many copies.
This is also emitted. It's in a loop, so we get many copies.
This is also emitted. It's in a loop, so we get many copies.
This is also emitted. It's in a loop, so we get many copies.

I've made my template emit plain text in this example to clarify the fact that T4 is fundamentally text-oriented. You could use it to generate C#, F#, VB.NET, markdown, HTML, Cucumber specs, or, as in this case, just plain text containing natural language.

In Rx.NET and Reaqtor we use T4 to generate repetitive code. For example, the Min and Max operators have multiple versions of what are essentially the same code, just for different numeric types. (Since .NET 7, there has been a better way to solve this particular problem: it introduced of new ways of defining interfaces, and the associated generic math feature. However, we still target .NET Framework in Rx.NET, so we can't use that.) We also often use templates driven by reflection to generate code whose structure is determined by other code.

Aren't we supposed to be using source generators now?

In theory the introduction of source generators renders T4 unnecessary for the ways we use it in Rx.NET and Reaqtor. Now there is direct support in the .NET SDK for generating code at build time.

However, having written a couple of source generators I find them to be a major step up in complexity from T4.

They enable developers to create really useful tools. For example, our Corvus.JsonSchema libraries offer Corvus.Json.SourceGenerator, which is now my go-to solution when I want to deal with JSON in C#. But while source generators can be great to use, they are a bit of a nightmare to write. So I think there is still a place for T4.

Tooling changes

To understand how to migrate an existing project to using .NET in T4, it's important to understand the differences in tooling support for T4 on .NET FX and T4 on .NET.

Visual Studio's existing support for .NET FX

Visual Studio has offered support for T4 for many years. You enabled it by adding this to your project file:

You could then tell Visual Studio that certain source files were T4 templates, and normally you would also tell it about the association between the T4 template and its generated output, e.g.:

The element here sets the Generator attribute to TextTemplatingFileGenerator, and this makes Visual Studio offer a couple of additional options on the file's context menu in Solution Explorer:

Selecting Run Custom Tool causes the T4 template to execute, generating its output. The Debug T4 Template runs it in the debugger so you can step through the template code.

T4 on .NET

The more recently added support for T4 on .NET provides one thing: the TextTransformCore command line tool. There is no Visual Studio integration. There is no supported way to tell Visual Studio to execute a template using .NET–VS (today) only offers the old .NET Framework-based T4 execution that has had for years.

So the new .NET support is all very bare bones. We get almost nothing compared to the support available when running a T4 template on .NET Framework. The old context menu items are still available, it's just that they can only invoke the old .NET Framework T4 tooling.

Changes required when using migrating T4 from .NET Framework to .NET

Note that if your are using assembly directives in your template you might need to change them because some .NET runtime library types are in different assemblies. For example, if a template written to run on .NET FX includes this line:

<#@ assembly name="System.Core" #>

you will probably need to change it to this to get it working on .NET:

<#@ assembly name="System.Linq" #>

You might also find that you are getting errors such as these:

Compiling transformation: CS1069: The type name 'Stack<>' could not be found in the namespace 'System.Collections.Generic'. This type has been forwarded to assembly 'System.Collections, Version=0.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a' Consider adding a reference to that assembly.

You may need to add this:

<#@ assembly name="System.Collections" #>

A more subtle problem is that the T4 tooling does not understand the distinction between reference assemblies and runtime assemblies. It always uses the latter, which can cause some surprises. For example, you might get an error of this form when trying to use the types in System.Xml.Linq:

error CS1069: Compiling transformation: CS1069: The type name 'XElement' could not be found in the namespace 'System.Xml.Linq'. This type has been forwarded to assembly 'System.Private.Xml.Linq, Version=8.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51' Consider adding a reference to that assembly.

You can resolve this by adding another assembly directive:

<#@ assembly name="System.Private.Xml.Linq" #>

but this is somewhat unsatisfactory: the fact that .NET 8.0 happens to put this type in this assembly is an implementation detail that could easily change from one version of .NET to the next. But for now this seems to be the only way to work around this. I've submitted a bug report at https://developercommunity.visualstudio.com/t/TextTransformCore-uses-runtime-not-ref/11013312? if you're having the same problem and want to add your support for this being fixed.

Better project support for T4 on .NET

Although there is no built in tooling, it's actually relatively straightforward to make an existing project use the newer tooling, once you know how. We can do this with some modifications to project files. The basic process is:

Define an ItemGroup for all your T4 templates
Automatically set the DependentUpon item metadata on the generated code (to ensure generated files go underneath their T4 files)
Define a custom Target that runs the T4 templates if the templates are newer than the generated outputs

Defining an item group for templates

I put this in a Directory.build.props file at the root of my solution, so that .tt files anywhere in any project in my solution are added to the item group:


  
    %(Filename).cs
    %(RelativeDir)%(GeneratedOutput)

The Include="**\*.tt" is a glob that adds all files with a .tt extension anywhere in any project to the TextTemplates item group.

We then set two item metadata values:

GeneratedOutput: the filename of the output that the template will generate
GeneratedOutputRelativePath: the path of the template output relative to the project folder

In fact, in the Reaqtor codebase, we do something slightly more complex:


  
    %(Filename).generated.cs
    %(Filename).cs
    %(RelativeDir)%(GeneratedOutput)

Ths reason for this is that historically the Reaqtor codebase has used two different conventions. In some cases, the a template called, say, ByteArray.tt generates a file with the same name but a .cs template, e.g. ByteArray.cs. However, in some places the T4 includes this directive:

<#@ output extension=".generated.cs" #>

For example, that appears in the LetOptimizerTests.tt template, and the effect is that the generated file is called LetOptimizerTests.generated.cs. (In this case, that's because the generated code is adding extra methods to a partial class, so there's already a non-generated LetOptimizerTests.cs file. The generated code needs to go into a file with a different name.) Just to confuse matters further, some templates have Generated in their name, e.g. PooledObjects.Generated.tt. Obviously in this case we don't want the generated file to be PooledObjects.Generated.generated.cs, so this one is really an example of the first convention in which the .tt becomes .cs in the generated output.

The more complex XML shown above takes this into account: it looks to see if a file with that .generated.cs extension exists, and if so, selects that filename as the target for the template. But if it's not present, it just picks the other name.

Note that it's actually the template itself that determines what the output file name is with that output extension directive. This project file content just looks at what files exist, and infers from that which convention was used.

Correct Solution Explorer behaviour with DependentUpon

To ensure that the source file that a template generates appears nested inside that template in Solution Explorer, I put this in the Directory.Build.targets:


  
    true
    true
    %(TextTemplates.Filename).tt

We put this in the Directory.Build.targets file so that it can run after everything in the .NET SDK's various .props files. Those will set up the Compile item group, which we just want to update. If we put this in the Directory.Build.props file, the Compile item group wouldn't exist yet so there would be nothing for us to Update.

With this in place, you can now remove all entries of this form from your project files:

These are no longer necessary because the preceding ItemGroup automatically sets the item group metadata correctly for all templates.

Custom target to execute templates

Finally, also in the Directory.Build.targets we define this custom target:

We've set this to execute before the PreBuildEvent, meaning that all T4 generation occurs before the main build work happens.

The Condition here ensures that this target only attempts to run when the build is in a Visual Studio environment. (Either we are using Visual Studio to run the build, or the build was run from a Visual Studio developer prompt.) This is necessary because the TextTransformCore tool is not part of the .NET SDK, so it's not universally available. It's part of Visual Studio.

Generated source files will be checked into source control, so we only ever need to run the T4 tool if the template changes. So in cases where someone just clones a repository and builds it, it won't matter if they don't have Visual Studio available because all the generated files will be present anyway. (But anyone wishing to modify a template, and to get the corresponding modified output, will need Visual Studio, because that's the only official way to get the TextTransformCore tool today.)

This target uses the Inputs and Outputs to ensure that it only runs templates when the .tt file's timestamp is newer than the generated source file. (This conditional timestamp-based execution is built into MSBuild. You just have to tell it how a target's inputs and outputs are related.)

Conclusions

We are no longer constrained to using .NET Framework inside T4 files. Although this means abandoning the built-in Visual Studio tool support, with some relatively simple project file modifications, it's possible to get your T4 files using a modern .NET runtime in a straighforward way. We do lose the ability to debug the templates, but we get automated re-execution of the templates as part of the build.

What is Retrieval-Augmented Generation (RAG)?

Carmel Eve — Thu, 05 Feb 2026 07:30:00 GMT

There has been a massive explosion in the use of generative AI, and when and how we use it has become an incredibly important question. It is easy enough to throw anything into an LLM, and a lot of those attempts will be met with mixed success.

One thing that LLMs are provably good for is the summarisation / re-structuring of text. But, how do we focus the model down to the information we care about - and stop the tendency of adding additional, plausible, but possibly irrelevant (or incorrect!) information? And how do we limit the information it needs to parse so that we don’t reach token limits or performance degradation due to huge inputs? This is where RAG comes in.

What is RAG?

Retrieval-Augmented Generation (RAG) is a multi-step process by which we retrieve relevant information, and then add that information to the context, along with the given prompt. It allows us to ground responses in our data, rather than relying solely on pre-trained knowledge.

The RAG process is as follows:

Retrieval: Retrieve relevant information. This can be from databases, documents, knowledge bases, etc. Often semantic / vector-based search is used to find information relevant to the input (more on this later).
Augmentation: The retrieved information is then added into the context, augmenting the prompt with the relevant data.
Generation: The language model generates a response, based on the augmented prompt.

This approach has a few advantages:

Information is up to date as it can be retrieved from a live source, rather than relying on what the model was trained on.
You can ground responses in your domain-specific knowledge, without needing a specialised model.
You can add references to specific documents / pieces of information, allowing you to cross-check the response.
By enforcing direct links to the data, you can reduce the chance of hallucinations (though with large caveats - the subject of a future post).
You do not need to release sensitive information into model training processes.
You can control access like you would to any databases, and only ever add information that a user is allowed to see into the context. This allows for a fine-grained security model that would be impossible if training the LLM on all of the data (in this case, there is no way to limit what data a user has access to, if they are given access to the LLM).
Reducing the amount of information that the model needs to process (by limiting the context to the most relevant information) means that you can usually use smaller models and still achieve great results.

Let's take an example of a retail website where customers can leave reviews. You might want to build an application that allows users to ask questions about the reviews that customers have left.

In a RAG example, the review data could be queried (by standard query, or vector search), and the reviews that are relevant would be retrieved. These reviews would then be used to augment the user's question, and added to the context that is passed into an LLM. The LLM would then use that context to generate an answer to the user's question.

Retrieval

At the heart of RAG is the ability to retrieve content that is relevant.

Database Query

This could take the form of a standard database query based on criteria. In our review example, imagine that these reviews are stored in a database. The user could filter the reviews using certain criteria - "Clothing Product", "Fewer than 2 stars", "From the Last 24 Hours", etc.

These criteria could be used to query the database, and all relevant information returned. This information would then be added as context to any questions that the user wanted to ask.

Keyword Search

Another option is to filter the responses based on keywords. In our example, the user could input keywords ("late", "expensive", "broken", etc.). In this way, you can filter the results down to those which directly talk about relevant topics.

There are services, such as Azure AI Search, which allow you to do fast keyword matching on documents. Azure AI Search also allows you to do "fuzzy" matching - which handles differences in capitalisation, spelling mistakes, etc.

Again, the relevant documents would be added to the context used to augment the user's questions.

Vector Search

In vector search, embeddings are used to find documents which are conceptually related to the search terms or questions asked.

An embedding is a numerical representation of text - essentially converting words and sentences into arrays of numbers (vectors) that capture their semantic meaning. These vectors contain many dimensions, allowing for complex concepts to be represented. Embedding models are typically neural networks trained on massive amounts of text to learn these relationships.

The secret to vector search is that, using these embeddings, semantically similar text ends up with vectors that are close together in this high-dimensional space. For example, imagine we had the words "snail" and "slug" and vectorised them, you might end up with two vectors that point in a similar direction.

Though similar, these two words don't mean quite the same thing, and this is captured in vector space:

Here we can see that the vectors for snail and slug are pointing in relatively similar directions, and the difference between the two is about equal to the vector for "shell".

This is a super simplified example, and you would obviously need a lot more dimensions to represent all of the complex information that makes up a "slug" or "snail". But using this, we can start to visualise how different information and the connections between it can be represented in vector space.

An important thing to understand here is that AI embeddings turn words into vectors in context. For example, the word "bow" in the sentence "she had a bow in her hair", would have a very different vector to "the actors took a bow". Embedding models don't really embed the meaning of just a word, but instead the meaning of a sentence or context around a word is represented in vector space.

In practice, in RAG scenarios, generally it is not a single word that is vectorised, but a whole sentence or block of text. The overall meaning of the text is then used to retrieve relevant results.

Going back to our retail example, when you embed the text "Shipping took forever" and "Delivery was extremely slow", they'll have similar vector representations despite different words, because they mean similar things. Meanwhile, "Material feels cheap and flimsy" will be far away in vector space.

The user might ask a question such as "What are customers saying about delivery?". Using vector search, the user's question (or prompt) can be vectorised, and used to find reviews that are relevant to what they're asking about.

Augmentation

Once relevant documents (and in the case of our example, reviews) have been retrieved, they are added to the context for the LLM. This usually involves literally adding the relevant documents into the prompt that is sent to the LLM.

For example, if a user asks "What are customers saying about delivery?", relevant reviews could be retrieved (using one of the methods above).

Then, the prompt sent into the LLM would include the user's question, and all of the relevant documents. E.g.:

Answer the following question: 'What are customers saying about delivery?', based on the customer reviews in the context provided. Do not use any information outside of what is contained in the given context.

Context:

Review 1

Rating: 5
Review Content: Amazing service! My order arrived in just 2 days, even though I only selected standard shipping. Very impressed.

Review 2

Rating: 1
Review Content: Shipping took over 3 weeks. No updates on tracking. Had to contact support multiple times. Completely unacceptable.

Review 3

Rating: 4
Review Content: Package arrived on time and in perfect condition. Tracking updates were accurate throughout the entire process.

Review 4

Rating: 3
Review Content: Delivery was supposed to be 5-7 days but took 10. Not terrible but not what was promised either.

Review 5

Rating: 1
Review Content: My package was marked as delivered but I never received it. Driver must have left it at the wrong address. Still waiting for resolution.

Review 6

Rating: 5
Review Content: Super fast shipping! Ordered on Monday and it was at my door by Wednesday morning. Packaging was secure too.

Review 7

Rating: 2
Review Content: Delivery took forever and the box was crushed when it finally arrived. Thankfully the product inside wasn't damaged.

It is worth noting that the retrieval of relevant information, rather than just augmenting the prompt with all review data, is an important step. This is because as prompt input sizes increase, the responses from the LLM degrade. Limiting the context to a smaller subset of relevant information is the best way to get useful responses.

Generation

The prompt will then be passed into an LLM. The LLM does not need to have been trained on any of the review data, as all of the relevant data is provided as part of the prompt.

The LLM will analyze the retrieved reviews and generate a natural language response. For our example question "What are customers saying about delivery?", the LLM might respond:

"Customer feedback on delivery is mixed. Positive reviews highlight fast shipping times, with some customers receiving orders in 2-3 days even with standard shipping. The tracking system is generally praised for being accurate. However, there are concerns about inconsistent delivery times - some orders took significantly longer than promised (10+ days instead of 5-7). There are also reports of damaged packages and delivery issues like packages being marked as delivered but not received."

This response is grounded entirely in the context provided - the LLM hasn't hallucinated information or drawn on training data that might be outdated or irrelevant to this specific business.

An important thing to note is that though using RAG will increase the likelihood of responses being grounded and relevant (and adding sentences to the prompt such as "Do not use any information outside of what is contained in the given context." will further increase this likelihood), there is still an inherent propensity for LLMs to fall back on training data, or include unexpected information in responses. As such, even in RAG situations, all responses must be validated against expected outputs.

Summary

RAG provides a powerful way to make LLMs more useful for real-world applications. By retrieving relevant information, augmenting prompts with that context, and then generating responses, we can:

Keep information current and accurate
Ground responses in domain-specific data
Reduce hallucinations (with caveats)
Maintain security and access controls
Use smaller, more efficient models

Thanks for reading this introduction to RAG search. Watch out for my next blog, which will dive deeper into implementation!

Polars Workloads on Microsoft Fabric

Barry Smart — Thu, 29 Jan 2026 05:34:00 GMT

TL;DR: Run fast, cost-effective analytics on Microsoft Fabric without Spark clusters by using Polars. This guide covers reading from OneLake, transforming data with lazy evaluation, writing to Delta tables, and seamlessly switching between local and Fabric environments.

Overview

Microsoft Fabric's Python Notebooks provide an ideal environment for running Polars-based analytics workloads. With Polars pre-installed and native access to OneLake, you can build fast, memory-efficient data pipelines without the overhead of Spark. This post walks through the practicalities: reading raw files, transforming data, and writing to Delta tables in your Lakehouse.

Key points:

Reading files: Use relative paths (/lakehouse/default/Files/...) for the pinned lakehouse, or ABFS paths for cross-workspace access
Reading Delta: pl.read_delta() or pl.scan_delta() for lazy evaluation
Writing Delta: df.write_delta() works out of the box; use dt.replace_time_zone("UTC") on timestamps to avoid SQL endpoint errors
Storage options: Only needed for cross-lakehouse access—pass {"bearer_token": notebookutils.credentials.getToken('storage'), "use_fabric_endpoint": "true"}
Performance: Use scan_* methods for large files, specify columns upfront, and consider tuning rowgroups (8M+ rows) for DirectLake consumption
Limitations: No V-ORDER or Liquid Clustering without Spark; max 64 vCores on single node

Why Polars on Fabric?

Fabric's new Python Notebooks run on a lightweight single-node container (2 vCores, 16GB RAM by default) rather than a Spark cluster. This is a better fit for many workloads:

Speed without complexity: Polars' Rust-based engine delivers Spark-comparable performance on datasets that fit in memory, without the cluster coordination overhead.
Cost efficiency: No Spark cluster spin-up means lower CU consumption for smaller jobs.
Rapid iteration: Sub-second notebook startup times versus minutes for Spark.
Seamless integration: Polars is pre-installed and OneLake paths work out of the box.

Microsoft explicitly recommends Polars (alongside DuckDB) as an alternative to pandas when you encounter memory pressure—a tacit acknowledgement that single-node, in-process tools have earned their place in the enterprise data stack.

Writing code that can run both locally and on Fabric

One of the major benefits that we find in using Polars is that we can develop locally (using local compute and local storage) and then deploy onto Fabric for fully hosted, production scale, automated operations.

This gives us the best of both worlds: a developer experience that feels like mainstream software engineering (fast inner dev loop with local unit tests which run in seconds), and the ability to deploy onto a cloud platform for orchestration and integration into the wider enterprise data pipeline ecosystem.

But in order to do this, we need to set up a simple helper function to detect where the code is running and set up the connections accordingly.

We tackle this in a few stages.

Firstly we need to detect if the code is running in a Fabric Python Notebook, we can determine that by checking for specific environment variables as follows:

import os

def is_fabric_python_notebook() -> bool:
    """Detect specifically a Python (non-Spark) notebook."""
    return (
        'JUPYTER_SERVER_HOME' in os.environ 
        and 'SPARK_HOME' not in os.environ
    )
    
logger.info(f"Is this running in a Fabric Python Notebook?: {is_fabric_python_notebook()}")

INFO: Is this running in a Fabric Python Notebook?: False

Next we need to be able to construct an abfss (Azure Blob File System Secure) path to files / folders that we want to be able to read from or write to using Polars.

The format of an abfss path adopts the following convention on Fabric:

abfss://{ws}@onelake.dfs.fabric.microsoft.com/{lh}.Lakehouse

Where {ws} is replaced by the Fabric workspace name and {lh} is replaced by the lakehouse name.

Furthermore, Fabric lakehouses are organised into two discrete areas:

Files - an area which is used to hold raw or unstructured content. It has an abfss path: abfss://{ws}@onelake.dfs.fabric.microsoft.com/{lh}.Lakehouse/Files/{relative_path}
Tables - an arae which holds tabular data (most commonly in Delta format). It uses an abfss path convention of: abfss://{ws}@onelake.dfs.fabric.microsoft.com/{lh}.Lakehouse/Tables/{schema_name}/{table_name}

There is also an option to "pin" a default lakehouse to a Fabric notebook and reference that in a shorthand path as follows:

Files: /lakehouse/default/Files/{relative_path}
Tables: /lakehouse/default/Tables/{schema_name}/{table_name}

It can be convenient to use this pinned default lakehouse for exploratory data analysis in notebooks. However for software that is destined to end up in production, we recommend using the full abfss path to explicitly reference the lakehouse.

Furthermore, you can only pin one lakehouse to a notebook, this doesn't work well with the common pattern we see where a notebook is reading from one lakehouse (e.g. "Bronze"), wrangling the data and writing out to another lakehouse (e.g. "Silver"). For this reason, it makes sense to declare the full abfss path for both the sync and target lakehouses just to keep things consistent.

So it is often useful to use a Python helper function to construct the abfss path from component parts:

def construct_base_abfss_path(workspace_name: str, lakehouse_name: str) -> str:
    """Construct the base ABFSS path for a given workspace and lakehouse."""
    # Because it is a URL, replace spaces with %20
    workspace_name = workspace_name.replace(" ", "%20")
    lakehouse_name = lakehouse_name.replace(" ", "%20")
    return f"abfss://{workspace_name}@onelake.dfs.fabric.microsoft.com/{lakehouse_name}.Lakehouse"

construct_base_abfss_path(workspace_name="polars_demo_workspace", lakehouse_name="polars_demo_lakehouse")

'abfss://polars_demo_workspace@onelake.dfs.fabric.microsoft.com/polars_demo_lakehouse.Lakehouse'

Finally, we need to pass the "storage options" information to Polars to enable it to read from or write to a Fabric lakehouse.

The storage_options parameter is a dictionary which needs to contain two named elements:

bearer_token - that will enable Polars to authenticate with the Fabric lakehouse API
use_fabric_endpoint - set to value "true" to tell Polars to leverage the fabric endpoint

The notebookutils Python package is installed in the Python environment used by Fabric notebooks. This enables you to retrieve the bearer token.

import notebookutils
storage_options = {
    "bearer_token": notebookutils.credentials.getToken('storage'),
    "use_fabric_endpoint": "true"
}

df.scan_csv(
    f"abfss://polars_demo_workspace@onelake.dfs.fabric.microsoft.com/polars_demo_lakehouse.Lakehouse/Files/land_registry_data/*.csv",
    storage_options=storage_options
)

Putting this all together, we can now set up the paths we will read from and write to using Polars dynamically based on whether we are running the notebook locally or on Fabric:

Detect if we are running the notebook in Fabric (specifcally testing to see if it is a Python notebook)
Build the base path:
- An abfss path if we are running on Fabric
- A standard file path if we are running locally
Additionally, if running on Fabric, import and leverage the notebookutils package to authenticate and generate a token that will enable connection to the lakehouses (provided we have permissions to do so)
Construct paths as required for source(s) and target(s) - in this case, we are keeping things simple:
- We are reading from and writing to the same workspace / lakehouse
- We are reading from one source (a folder containing *.csv files)
- We are writing to a three target tables: house_prices, dates and locations

class FabricPaths:
    
    def __init__(self, workspace_name: str, lakehouse_name: str, local_base_path: str = "data/fabric"):
        self.workspace_name = workspace_name
        self.lakehouse_name = lakehouse_name
        self.local_base_path = local_base_path
        self.is_fabric = FabricPaths._is_fabric_python_notebook()
        
        if self.is_fabric:
            import notebookutils  # This Python package is only available on Fabric, so we need to import it conditionally.
    
    def generate_file_path(self, relative_path: str) -> str:
        """Generate a full file path for the given folder type and name."""
        base_path = self._construct_base_abfss_path()
        return f"{self._get_base_path()}/Files/{relative_path}"
    
    def generate_table_path(self, schema_name: str, table_name: str) -> str:
        """Generate a full table path for the given schema and table name."""
        base_path = self._construct_base_abfss_path()
        return f"{self._get_base_path()}/Tables/{schema_name}/{table_name}"
    
    def get_storage_options(self):
        """Get storage options for accessing Fabric storage."""
        if self.is_fabric:
            storage_options = {
                "bearer_token": notebookutils.credentials.getToken('storage'),
                "use_fabric_endpoint": "true"
            }
        else:
            storage_options = {}
        return storage_options
    
    def _get_base_path(self) -> str:
        """Get the appropriate base path depending on the environment."""
        if self.is_fabric:
            return self._construct_base_abfss_path()
        else:
            return self.local_base_path
    
    @staticmethod
    def _is_fabric_python_notebook() -> bool:
        """Detect specifically a Python (non-Spark) notebook."""
        return (
            'JUPYTER_SERVER_HOME' in os.environ 
            and 'SPARK_HOME' not in os.environ
        )
    
    def _construct_base_abfss_path(self) -> str:
        """Construct the base ABFSS path for a given workspace and lakehouse."""
        # Because it is a URL, replace spaces with %20
        workspace_name = self.workspace_name.replace(" ", "%20")
        lakehouse_name = self.lakehouse_name.replace(" ", "%20")
        return f"abfss://{workspace_name}@onelake.dfs.fabric.microsoft.com/{lakehouse_name}.Lakehouse"

# Now use this class to generate paths

# We only need one class because we are working within a single Fabric workspace and lakehouse
fabric_paths = FabricPaths(
    workspace_name="polars_demo_workspace",
    lakehouse_name="polars_demo_lakehouse",
    local_base_path="../data/fabric"
    )

# Generate paths
raw_data_download_path = fabric_paths.generate_file_path("land_registry_data")
logger.info(f"Path to download CSV files into: {raw_data_download_path}")

source_path = fabric_paths.generate_file_path("land_registry_data/*.csv")
logger.info(f"Glob path to read all CSV files: {source_path}")

target_path_prices = fabric_paths.generate_table_path("house_price_analytics", "prices")
logger.info(f"Target table path for price fact table: {target_path_prices}")

target_path_dates = fabric_paths.generate_table_path("house_price_analytics", "dates")
logger.info(f"Target table path for date dimension table: {target_path_dates}")

target_path_locations = fabric_paths.generate_table_path("house_price_analytics", "locations")
logger.info(f"Target table path for location dimension table: {target_path_locations}")

storage_options = fabric_paths.get_storage_options()
logger.info(f"Storage options for accessing Fabric storage: {storage_options}")

When running locally, this generates the following log:

INFO: Path to download CSV files into: ../data/fabric/Files/land_registry_data
INFO: Glob path to read all CSV files: ../data/fabric/Files/land_registry_data/*.csv
INFO: Target table path for price fact table: ../data/fabric/Tables/house_price_analytics/prices
INFO: Target table path for date dimension table: ../data/fabric/Tables/house_price_analytics/dates
INFO: Target table path for location dimension table: ../data/fabric/Tables/house_price_analytics/locations
INFO: Storage options for accessing Fabric storage: {}

When running in a Fabric Python Notebook, it generates the following log:

Path to download CSV files into: abfss://polars_demo_workspace@onelake.dfs.fabric.microsoft.com/polars_demo_lakehouse.Lakehouse/Files/land_registry_data
Glob path to read all CSV files: abfss://polars_demo_workspace@onelake.dfs.fabric.microsoft.com/polars_demo_lakehouse.Lakehouse/Files/land_registry_data/*.csv
Target table path for price fact table: abfss://polars_demo_workspace@onelake.dfs.fabric.microsoft.com/polars_demo_lakehouse.Lakehouse/Tables/house_price_analytics/prices
Target table path for date dimension table: abfss://polars_demo_workspace@onelake.dfs.fabric.microsoft.com/polars_demo_lakehouse.Lakehouse/Tables/house_price_analytics/dates
Target table path for location dimension table: abfss://polars_demo_workspace@onelake.dfs.fabric.microsoft.com/polars_demo_lakehouse.Lakehouse/Tables/house_price_analytics/locations
Storage options for accessing Fabric storage: {'bearer_token': '[REDACTED]', 'use_fabric_endpoint': 'true'}

That's it! The rest of the code is identical for both environments: we use the helper class above to take care of the only things that need to change: how the path is formed and setting up the storage_options for connecting in OneLake.

This class can become more sophisticated, for example:

Adding a third option: run code locally, but connect to Fabric lakehouse for reading and writing data
Handling for default pinned lakehouses
Checking the workspace and lakehouse specified exist by calling Fabric APIs
Wrapping the logic above into a package and deploying it on Fabric so it is available across all notebooks

But we have kept it simple in this case to illustrate the key concepts.

Download data

To support this use case, we are going to download some open data prime the "files" area with raw data we can analyse.

The are sourcing this from the UK Land Registry House Price Data open data repository.

Data is available for us under an Open Government Licence.

import requests
import fsspec

HOUSE_PRICE_BASE_URL = "http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/"

# Each file is approximately 100MB in size.  Change the number of years to control the total data size.
NUMBER_OF_YEARS = 3

list_of_files = [f"pp-{year}.csv" for year in range(2025, 2025 - NUMBER_OF_YEARS, -1)]

for file_name in list_of_files:
  
    remote_file_url = f"{HOUSE_PRICE_BASE_URL}{file_name}"
    path_to_save_file = raw_data_download_path + "/" + file_name

    # Download the CSV file with streaming enabled to avoid OOM on limited memory
    with requests.get(remote_file_url, stream=True) as response:
        response.raise_for_status()  # Ensure we notice bad responses

        # fsspec automatically handles the protocol (file:// versus abfss://) based on the source_path
        with fsspec.open(path_to_save_file, mode='wb', **storage_options) as f:
            # Write in 1MB chunks
            for chunk in response.iter_content(chunk_size=1024*1024):
                f.write(chunk)

    logger.info(f"Downloaded {file_name} to: {path_to_save_file}")

INFO: Downloaded pp-2025.csv to: ../data/fabric/Files/land_registry_data/pp-2025.csv
INFO: Downloaded pp-2024.csv to: ../data/fabric/Files/land_registry_data/pp-2024.csv
INFO: Downloaded pp-2023.csv to: ../data/fabric/Files/land_registry_data/pp-2023.csv

Reading files

When you create a new Python Notebook in Fabric you get immediate access to:

Polars (currently v1.6 in the default environment)
The delta-rs library for Delta Lake operations

You can use all of the common Polars functions to read files from a Fabric lakehouse both eager and lazy versions:

Format	Eager Read	Lazy Read	Eager Write	Lazy Write
CSV	`pl.read_csv()`	`pl.scan_csv()`	`df.write_csv()`	`lf.sink_csv()`
Excel	`pl.read_excel()`	❌	`df.write_excel()`	❌
Parquet	`pl.read_parquet()`	`pl.scan_parquet()`	`df.write_parquet()`	`lf.sink_parquet()`
JSON	`pl.read_json()`	❌	`df.write_json()`	❌
NDJSON	`pl.read_ndjson()`	`pl.scan_ndjson()`	`df.write_ndjson()`	`lf.sink_ndjson()`
Delta	`pl.read_delta()`	`pl.scan_delta()`	`df.write_delta()`	💡 coming soon?

The reason you don’t see a sink_delta() method in Polars for Python is that it’s very new and not yet part of the stable public API. It was introduced in late 2025 in Polars’ Rust core to allow streaming writes directly to Delta Lake without collecting all data in memory first.

As of the last stable release (early 2026), the Polars Python package does not expose LazyFrame.sink_delta() or DataFrame.sink_delta() in the public API. The Polars team has indicated that sink_delta will likely appear in future stable releases once the Python bindings are finalized and tested. Once available, this will enable Polars to do more with less in terms of RAM.

In this demo, we are going to use the lazy API to read the CSV files we downloaded above. Once we've built up our transformations over the CSV sourced LazyFrame, we'll need to do a .collect() before using write_delta().

import polars as pl

logging.info(f"Reading price paid data from location {source_path}...")

# Files area
price_paid_data = pl.scan_csv(
    source_path,  # ABFSS path to the CSV files in the Files area.
    has_header=False,
    null_values=[""],
    storage_options=storage_options,  # Provides Polars with the necessary credentials to read from Fabric.
    infer_schema=False,
    schema={
        "transaction_unique_identifier": pl.Utf8,
        "price": pl.Float64,
        "date_of_transfer": pl.Datetime,
        "postcode": pl.Utf8,
        "property_type": pl.Utf8,
        "old_new": pl.Utf8,
        "duration": pl.Utf8,
        "paon": pl.Utf8,
        "saon": pl.Utf8,
        "street": pl.Utf8,
        "locality": pl.Utf8,
        "town_city": pl.Utf8,
        "district": pl.Utf8,
        "county": pl.Utf8,
        "ppd_category_type": pl.Utf8,
        "record_status": pl.Utf8
    })

price_paid_data.head(5).collect_schema()

Schema([('transaction_unique_identifier', String),
        ('price', Float64),
        ('date_of_transfer', Datetime(time_unit='us', time_zone=None)),
        ('postcode', String),
        ('property_type', String),
        ('old_new', String),
        ('duration', String),
        ('paon', String),
        ('saon', String),
        ('street', String),
        ('locality', String),
        ('town_city', String),
        ('district', String),
        ('county', String),
        ('ppd_category_type', String),
        ('record_status', String)])

Data Transformation

Now we can have a lazy frame in place, we can start to build up the transformations we want apply using Polars' composable expression API:

# Convert the property_type column from single letter codes to full descriptions
price_paid_data = (
    price_paid_data
    .with_columns(
        pl.when(pl.col("property_type") == "D")
        .then(pl.lit("Detached"))
        .when(pl.col("property_type") == "S")
        .then(pl.lit("Semi-Detached"))
        .when(pl.col("property_type") == "T")
        .then(pl.lit("Terraced"))
        .when(pl.col("property_type") == "F")
        .then(pl.lit("Flat/Maisonette"))
        .when(pl.col("property_type") == "O")
        .then(pl.lit("Other"))
        .otherwise(pl.col("property_type"))
        .alias("property_type")
    )
)

# Do the same of old_new
price_paid_data = (
    price_paid_data
    .with_columns(
        pl.when(pl.col("old_new") == "Y")
        .then(pl.lit("New"))
        .when(pl.col("old_new") == "N")
        .then(pl.lit("Old"))
        .otherwise(pl.col("old_new"))
        .alias("old_new")
    )
)

# Use regex to extract the postcode area (the first one or two letters)
price_paid_data = (
    price_paid_data
    .with_columns(
        pl.col("postcode")
        .str.extract(r"^([A-Z]{1,2})", 1)
        .alias("postcode_area")
    )
)

# Convert date_of_transfer from datetime to date
price_paid_data = (
    price_paid_data
    .with_columns(
        pl.col("date_of_transfer")
        .dt.date()
        .alias("date_of_transfer")
    )
)

Create fact table

Select the core columns we want to use in the core fact table.

# Select relevant columns for downstream analysis
prices = price_paid_data.select([
    "price",
    "date_of_transfer",
    "postcode_area",
    "town_city",
    "property_type",
    "old_new",
])

Create date dimension

Use min and max dates to build date dimension table.

At this stage we need to materialise the data. But given we are operating over a single column, the operation will be optimised through projection pushdown.

min_date = price_paid_data.select(pl.col("date_of_transfer").min()).collect()[0,0]
max_date = price_paid_data.select(pl.col("date_of_transfer").max()).collect()[0,0]
min_date, max_date

(datetime.date(2023, 1, 1), datetime.date(2025, 11, 28))

dates = (
    pl.date_range(
        start=min_date,
        end=max_date,
        interval="1d",
        eager=True,
    ).
    to_frame(name="date")
    .with_columns([
        pl.col("date").dt.year().alias("year"),
        pl.col("date").dt.month().alias("month"),
        pl.col("date").dt.strftime("%B").alias("month_name"),
        pl.col("date").dt.day().alias("day"),
        pl.col("date").dt.weekday().alias("weekday"),
        pl.col("date").dt.strftime("%A").alias("weekday_name"),
        pl.col("date").dt.ordinal_day().alias("day_of_year"),
    ])
)

Create location dimension

Assumption is there is a hierarchy in decreasing order of granularity:

County
District
Town or City

locations = (
    price_paid_data
    .select(
        [
            "county",
            "district",
            "town_city",
        ]
    )
    .unique()
)

Writing to Delta Tables

It is common practice to write out a Polars DataFrame to a Delta table in the Tables area of your Lakehouse.

There are various write modes which are available:

Overwrite entire table:

df.write_delta(path, mode="overwrite")

Append to existing table:

df.write_delta(path, mode="append")

Merge (upsert) - returns a TableMerger for chaining:

(
    df.write_delta(
        path,
        mode="merge",
        delta_merge_options={
            "predicate": "source.id = target.id",
            "source_alias": "source",
            "target_alias": "target"
        }
    )
    .when_matched_update_all()
    .when_not_matched_insert_all()
    .execute()
)

Handling Timestamps

A common gotcha when writing Delta tables from Polars is timezone handling. Fabric's SQL endpoint expects timestamps with timezone information.

We can address this by adding timezone information, for example:

df = (
    df
    .with_columns(
        [
            pl.col("datetime_of_order")
            .dt.replace_time_zone("UTC")
            .alias("datetime_of_order")
        ]
    )
)

Write tables

logger.info(f"Writing prices data to Delta table: {target_path_prices}")
prices.collect().write_delta(target_path_prices, mode="overwrite", storage_options=storage_options)

INFO: Writing prices data to Delta table: ../data/fabric/Tables/house_price_analytics/prices
INFO:notebook_logger:Writing prices data to Delta table: ../data/fabric/Tables/house_price_analytics/prices

logger.info(f"Writing locations data to Delta table: {target_path_locations}")
locations.collect().write_delta(target_path_locations, mode="overwrite", storage_options=storage_options)

INFO: Writing locations data to Delta table: ../data/fabric/Tables/house_price_analytics/locations
INFO:notebook_logger:Writing locations data to Delta table: ../data/fabric/Tables/house_price_analytics/locations

logger.info(f"Writing dates data to Delta table: {target_path_dates}")
dates.write_delta(target_path_dates, mode="overwrite", storage_options=storage_options)

INFO: Writing dates data to Delta table: ../data/fabric/Tables/house_price_analytics/dates
INFO:notebook_logger:Writing dates data to Delta table: ../data/fabric/Tables/house_price_analytics/dates

Reading from DeltaLake

When we are reading delta files, we can use the Lazy execution framework to maximise scale and performance.

Let's illustrate this by doing generating some analytics in this notebook using the data we have just written to the lakehouse in Delta format.

# Load prices from Delta and filter them to exclude "Other" property types
logger.info(f"Reading prices data back from Delta table: {target_path_prices}")
prices = (
    pl.scan_delta(
        target_path_prices,
        storage_options=storage_options,
    )
    .filter(pl.col("property_type") != "Other")
)

INFO: Reading prices data back from Delta table: ../data/fabric/Tables/house_price_analytics/prices
INFO:notebook_logger:Reading prices data back from Delta table: ../data/fabric/Tables/house_price_analytics/prices

# Load the date dimension, a new month_tag column in the form YYYY_MM
logger.info(f"Reading dates data back from Delta table: {target_path_dates}")
dates = (
    pl.scan_delta(
        target_path_dates,
        storage_options=storage_options,
    )
    .with_columns(
        [
            pl.col("date").dt.strftime("%Y_%m").alias("month_tag")
        ]
    )
)

INFO: Reading dates data back from Delta table: ../data/fabric/Tables/house_price_analytics/dates
INFO:notebook_logger:Reading dates data back from Delta table: ../data/fabric/Tables/house_price_analytics/dates

# Now join the two tables to get month_tag into the prices table
prices = (
    prices
    .join(
        dates.select(
            [
                "date",
                "month_tag"
            ]
        ),
        left_on="date_of_transfer",
        right_on="date",
        how="left"
    )
)

# Finally summarise the data up to monthly level by property type
monthly_summary = (
    prices
    .group_by(
        [
            "month_tag",
            "property_type"
        ]
    )
    .agg(
        [
            pl.len().alias("number_of_transactions"),
            pl.col("price").median().alias("median_price"),
            pl.col("price").min().alias("min_price"),
            pl.col("price").max().alias("max_price"),
        ]
    )
    .sort(
        [
            "month_tag",
            "property_type"
        ]
    )
)

monthly_summary = monthly_summary.collect()

# Plot the monthly summary using Plotly
import plotly.express as px

fig = px.line(
    monthly_summary,
    x="month_tag",
    y="median_price",
    color="property_type",
    title="Median House Prices by Property Type",
    labels={
        "month_tag": "Month",
        "median_price": "Median Price (£)",
        "property_type": "Property Type"
    }
)
fig.update_yaxes(range=[0, 600000])
fig.show()

Performance Optimisation Tips

Use lazy evaluation for large datasets - for datasets approaching memory limits, lazy evaluation lets Polars optimise the query plan.
Optimise row groups for DirectLake - if your Delta tables will be consumed by Power BI's DirectLake mode, configure larger rowgroups. See this blog "Delta Lake Tables For Optimal Direct Lake Performance In Fabric Python Notebook" from Sandeep Pawar (Principal Program Manager, Microsoft Fabric CAT) for more details.
Scale up your notebook environment when needed - using the %%configure magic command in a cell at the top of the notebook. Available configurations: 4, 8, 16, 32, or 64 vCores (memory scales proportionally).

Current Limitations

A few things to be aware of:

V-ORDER - Fabric's V-ORDER optimisation requires Spark; Polars-written Delta tables won't have this applied. Tuning rowgroups can partially compensate.
Liquid Clustering - similarly, Liquid Clustering is Spark-only.
Polars version - the pre-installed version may lag behind the latest release. You can upgrade with %pip install polars --upgrade, though this adds notebook startup time.
Memory ceiling - the maximum single-node configuration is 64 vCores. Beyond that, you'll need Spark or Polars Cloud (when available).

Summary

Polars on Microsoft Fabric offers a compelling alternative to Spark for many data engineering workloads. The combination of Polars' performance, Fabric's native OneLake integration, and the cost efficiency of single-node compute creates a practical path for teams who want enterprise-grade data pipelines without the complexity of distributed systems.

Start small, measure your workloads, and scale to Spark only when you genuinely need distributed compute. For many teams, that day may never come.

This is Part 4 of our Adventures in Polars series:

Part 1: Why Polars Matters — The Decision Makers Guide for Polars.
Part 2: What Makes Polars So Scalable and Fast? — The technical deep-dive: lazy evaluation, query optimisation, parallelism, and the Rust foundation.
Part 3: Code Examples for Everyday Data Tasks — Hands-on examples showing Polars in action.

Are you running Polars workloads on Microsoft Fabric? Have you found effective patterns for switching between local development and cloud deployment? We'd love to hear about your experiences in the comments below!

Practical Polars: Code Examples for Everyday Data Tasks

Barry Smart — Thu, 29 Jan 2026 05:33:00 GMT

TL;DR: This hands-on guide provides practical Polars code examples for common data tasks. We cover data loading from various sources (CSV, Parquet, JSON), data exploration techniques, powerful data transformations using expressions, aggregations and grouping and joining datasets. By explicitly showing the code for these everyday tasks, this guide serves as a reference for applying Polars' performance advantages to real-world data analysis workflows. Both eager execution (for interactive work) and lazy execution (for optimized performance) approaches are demonstrated, helping you leverage Polars' full potential regardless of your use case.

In our previous articles, we've introduced Polars and explored its performance architecture. Now it's time to get hands-on with practical examples of Polars in action.

This guide focuses on concrete code examples for common data tasks. Whether you're new to Polars or looking to expand your skills, these examples will help you apply Polars to your everyday data analysis workflows.

We use a set of data downloaded from the World Bank Open Data specifically via the Data 360 API. This data is made available by the World Bank under a Creative Commons 4.0 license. It's a brilliant example of the power of open data and we'd like to thank the World Bank for making this data available.

Specifically, we will be working with the World Bank Indicators dataset, which contains macroeconomic series (GDP, population, education, etc.) for every country and year. We have prepared this data in a range of different formats to demonstrate how Polars can successfully interact with different source data formats.

You can download and run the source code in this blog which is available (along with other Polars examples) in the following public GitHub repo: https://github.com/endjin/endjin-polars-examples

Before we start: what is a DataFrame?

Polars is a DataFrame library available in Python. But what do we mean by a DataFrame?

A dataframe is a two-dimensional, in memory, tabular data structure that organises information into rows and columns, much like a spreadsheet or a database table. Each column represents a variable or attribute, while each row represents a single observation or record.

In Polars specifically, a DataFrame is a 2-dimensional heterogeneous data structure that is composed of multiple Series. A Series is a 1-dimensional homogeneous data structure, meaning it holds data of a single data type (e.g., all integers, or all strings). These Series represent the columns in the dataframe.

What makes dataframes particularly powerful is that while each individual column (Series) is homogeneous and strictly typed, the DataFrame as a whole allows for different data types across its column (e.g. strings, integers, floats, dates, boolean, complex types) while still allowing operations across the entire structure. This flexibility makes them an intuitive and practical abstraction for working with the kind of structured data that dominates analytical workloads.

For Python developers, the dataframe concept was popularised by pandas, which became the de facto standard for data manipulation over the past decade. However, as datasets have grown larger and performance expectations have increased, the limitations of pandas have become more apparent.

This has created space for newer libraries like Polars to reimagine the dataframe from the ground up, retaining the familiar mental model while delivering significantly improved performance through modern design choices such as lazy evaluation, parallel execution, and memory-efficient columnar storage that take advantage of modern compute hardware.

If you've worked with pandas, SQL result sets, or even Excel tables, you already understand the core concept. Polars simply executes on it faster and more efficiently.

Developer Experience: "Come for the Speed, Stay for the API"

The Polars community has a saying: "Come for the speed, stay for the API." This captures an important truth about Polars' adoption - while performance often drives initial interest, the well-designed developer experience keeps users engaged.

Polars achieves this through:

Consistency- methods use snake_case naming with predictable patterns
Fluent interface - method chaining creates readable data transformation pipelines
Error messages - clear, actionable feedback when something goes wrong
Schema enforcement - strong typing that prevents common Pandas "surprises" that often thwart developers
Expression system - a composable language for data manipulation

As Vink puts it: "just write what you want and we will apply those optimizations for you... write readable idiomatic queries which just explain your intent and we will figure out how to make it fast."¹ This philosophy places user experience on equal footing with performance.

This is the behaviour and benefit of a well architected system and precisely why we do the yearly .NET Performance Boost posts: we change 0 lines of code and yet we still get big performance benefits from people in the core team optimising the internals.

For those who are familiar with the PySpark SQL and DataFrame API, adoption of Polars will be quite straightforward.

For those who are coming from a Pandas background: Polars has some similarities in terms of method names, but the fundamental structure of the API is different and perhaps more fundamentally Polars applies a stricter set of principles. We may cover a "Polars versus Pandas" blog in the future, if there is demand for more detail.

Setup and Installation

Before we begin, let's make sure Polars is installed and imported correctly.

We are using uv to manage our Python environment in this demo, so we have run the following command:

uv add polars

If you are using pip, the equivalent command would be pip install polars.

If you are using poetry, the equivalent command would be poetry add polars.

Next stage is to import Polars (the convention is to use an alias of pl) and check the version we are using:

# Import Polars
import polars as pl

# Check version
f"Polars version: {pl.__version__}"

'Polars version: 1.35.2'

Creating Dataframes from scratch

There are a range of other formats which are supported for creation of Polars dataframes in code, the most common one we tend to adopt is from a list of dict (or dataclass) objects as demonstrated below. This example also shows how Polars elegantly handles null values.

The creation of a dataframe in code is useful for creating test cases in unit tests and for exploring the Polars API with small "toy" datasets.

from datetime import datetime

# From list of dictionary based records
df = pl.DataFrame(
    [
        {'column_a': 1, 'column_b': 'Red', 'column_c': None, 'column_d': 10.5, 'column_e': datetime(2020, 1, 1)},
        {'column_a': 2, 'column_b': 'Blue', 'column_c': False, 'column_d': None, 'column_e': datetime(2021, 2, 2)},
        {'column_a': None, 'column_b': 'Green', 'column_c': True, 'column_d': 30.1, 'column_e': None},
        {'column_a': 4, 'column_b': None, 'column_c': False, 'column_d': 40.7, 'column_e': datetime(2023, 4, 4)},
        {'column_a': 5, 'column_b': 'Purple', 'column_c': True, 'column_d': 50.2, 'column_e': datetime(2024, 5, 5)},
    ]
)
df

column_a	column_b	column_c	column_d	column_e
i64	str	bool	f64	datetime[μs]
1	"Red"	null	10.5	2020-01-01 00:00:00
2	"Blue"	false	null	2021-02-02 00:00:00
null	"Green"	true	30.1	null
4	null	false	40.7	2023-04-04 00:00:00
5	"Purple"	true	50.2	2024-05-05 00:00:00

In the output above, you can see Polars displays the dataframe in tabular form with column names and data types displayed.

Polars implements a strict, statically-known type system. Unlike pandas, where output data types can change depending on the data itself, Polars guarantees that schemas are known before query execution. This means that when you apply a join, filter, or transformation, you can predict exactly what the output type will be independently from the actual data flowing through.

This predictability is a significant benefit for developers: if you expect an integer column and write downstream code that depends on integer behaviour, you won't discover a surprise float conversion halfway through a pipeline (as can occur with null values in an integer column).

If there's a type mismatch, Polars throws an error before the query runs, not twenty steps into your processing chain. As Ritchie Vink, the creator of Polars, puts it: "this strictness will save you a lot of headaches".

In the top left area, the "shape" of the dataframe is displayed using the tuple (number of rows, number of columns).

Loading Data from other sources

Of course, in the majority of scenarios you will want to load data into a Polars dataframe from an external datasource.

Unlike similar tools such as DuckDB, Polars does not offer a native means with which to persist data on storage. It relies on well established standards to both ingest and persist data.

We now walk through the key examples below.

Loading CSV data

One of the most common sources of data will be from Comma Separated Value (CSV) files. Here we load the countries.csv data (country metadata) from the World Bank.

Polars provides two approaches for loading CSV files. In this case, the file is small (less than 300 rows of data across 9 columns), so we load it using the "eager" polars.read_csv method. Later on this in this blog we'll demonstrate how the "lazy loading" polars.scan_csv method may be a better choice for larger data sets.

The eager approach immediately reads the entire file into memory and returns a DataFrame, which is convenient for exploratory work with smaller files such as in this use case.

You can see that it uses sensible defaults to read the header row as column names, it also infers data types from the data (we need to provide an additional hint to treat empty strings as null values to enable longitude and latitude to be treated as floating point numbers).

countries = pl.read_csv(CSV_COUNTRIES_PATH, infer_schema=True, null_values=[""])

# Display the dataframe.  By default, Polars shows the first and last 5 rows.
countries

country_code	iso2_code	country_name	region	region_id	income_level	capital_city	longitude	latitude
str	str	str	str	str	str	str	f64	f64
ABW	AW	Aruba	Latin America & Caribbean	LCN	High income	Oranjestad	-70.0167	12.5167
AFE	ZH	Africa Eastern and Southern	Aggregates	NA	Aggregates	null	null	null
AFG	AF	Afghanistan	Middle East, North Africa, Afg…	MEA	Low income	Kabul	69.1761	34.5228
AFR	A9	Africa	Aggregates	NA	Aggregates	null	null	null
AFW	ZI	Africa Western and Central	Aggregates	NA	Aggregates	null	null	null
…	…	…	…	…	…	…	…	…
XZN	A5	Sub-Saharan Africa excluding S…	Aggregates	NA	Aggregates	null	null	null
YEM	YE	Yemen, Rep.	Middle East, North Africa, Afg…	MEA	Low income	Sana'a	44.2075	15.352
ZAF	ZA	South Africa	Sub-Saharan Africa	SSF	Upper middle income	Pretoria	28.1871	-25.746
ZMB	ZM	Zambia	Sub-Saharan Africa	SSF	Lower middle income	Lusaka	28.2937	-15.3982
ZWE	ZW	Zimbabwe	Sub-Saharan Africa	SSF	Lower middle income	Harare	31.0672	-17.8312

Loading from Parquet files

Parquet has become the interchange format of choice for high-performance analytics, and Polars is designed to take full advantage of its characteristics. Unlike CSV, Parquet is a columnar, compressed, binary format that embeds schema metadata directly in the file - column names, data types, and statistics are all self-describing. This means no inference guesswork is required when loading data, and the strict typing aligns perfectly with Polars' philosophy of statically-known schemas.

We read parquet files using the polars.read_parquet method (or polars.scan_parquet for lazy loading).

The columnar layout delivers significant benefits for analytical workloads. Compression is highly effective because similar data types are stored together, resulting in files that are often 10-100x smaller than equivalent CSVs. More importantly, Polars can read only the columns your query actually needs without touching the rest of the file: a technique known as projection pushdown that dramatically reduces I/O.

So is a recommendation that if you are going to working with a large dataset stored in CSV format, it is often worthwhile to convert from CSV to Parquet because it will reduce the footprint of the data on the filesystem and improve the speed of the inner dev loop of exploring the data.

Intelligent Scan Optimisation

Where Polars really shines is in its use of Parquet's embedded statistics. Each Parquet file contains metadata about its row groups, including minimum and maximum values for each column. When you apply a filter in a lazy query, Polars examines these statistics and can skip entire row groups that cannot possibly contain matching rows - without reading any of the underlying data. Combined with predicate pushdown (applying filters at scan level rather than after materialisation), this means Polars often reads only a fraction of the file.

As Ritchie Vink explains: "if you do a filter, we look at the parquet statistics in the file and we will not first read the whole file and then apply the filter, we will apply the filters while we're reading it in." The result is that a well-optimised Parquet based workflow can be orders of magnitude faster than the equivalent CSV processing, with substantially lower memory consumption.

Loading multiple files in one operation

In practice, data lakes rarely consist of a single Parquet file. Upstream processes typically write data in partitions. Perhaps one file per day, per source system, or per logical partition. Rather than requiring you to enumerate each file individually, Polars supports glob patterns that match multiple files in a single operation.

We have simulated Hive-style partitioned data in the test data by creating a suite of parquet files using a partitioning strategy based on year. This results in one folder for each year and one (or potentially more) Parquet file within each folder.

Using the globbing pattern we can load all of the files in one operation using the following pattern where the * wildcard matches any characters within a single directory level (you can use ** to match across multiple directory levels).

metrics = pl.read_parquet(PARQUET_FOLDER / "year=*" / "*.parquet")

# Display 5 randomly sampled rows from the dataframe
metrics.select(["country_code", "country_name", "region", "year", "WB_WDI_SP_POP_TOTL"]).sample(5)

country_code	country_name	region	year	WB_WDI_SP_POP_TOTL
str	str	str	i64	f64
CRI	Costa Rica	Latin America & Caribbean	1974	2.039643e6
NGA	Nigeria	Sub-Saharan Africa	1985	8.4897973e7
IDN	Indonesia	East Asia & Pacific	2019	2.72489381e8
CHN	China	East Asia & Pacific	2018	1.4028e9
ALB	Albania	Europe & Central Asia	1975	2.404831e6

Loading multiple JSON files

Polars supports two primary JSON formats: newline-delimited JSON (NDJSON) and standard JSON arrays. NDJSON, where each line is a separate JSON object, is the more performant option for large datasets because it can be processed in a streaming fashion without loading the entire file into memory.

As with CSV, Polars infers the schema by sampling the data. The same principle applies: once the schema is established, it's enforced strictly throughout the query. For nested JSON structures, Polars maps these to its native Struct and List types, which are properly typed and benefit from Polars' vectorised execution.

That said, JSON is inherently less efficient than columnar formats for analytical workloads. It's row-oriented, text-based, and lacks the embedded statistics that enable predicate pushdown. If you're regularly processing JSON at scale, it's often worth converting to Parquet as a one-time transformation - you'll recoup the conversion cost quickly through faster subsequent reads.

In this example we are using the globbing functionality to load multiple JSON files simulating the type of raw data that is often generated from web APIs.

json_metrics = (
    pl.read_ndjson(JSON_FOLDER / "*_data.json")
)

When we display the first 2 rows of the JSON data below, you can see that it has loaded two of the columns as custom datatypes. Polars has scanned the first N elements of JSON and determined that the country_info column contains a struct datatype with 6 elements. The indicators column contains a list[struct] an array of structs each with 4 elements.

This works really well if your JSON data is consistent in terms of structure and content. Leaving Polars to define the schema in this way for JSON data which can vary in structure beyond the first N rows can be problematic and will likely need manual definition of schema to load reliably.

# Display the first 2 rows of the dataframe.
json_metrics.head(2)

country_code	country_info	indicators
str	struct[6]	list[struct[4]]
ALB	{"Albania","Europe & Central Asia","Upper middle income","Tirane","41.3317","19.8172"}	[{"WB_WDI_NY_GDP_PCAP_CD","GDP per capita (current US$)","Weighted average",[{2019,6069.439031}, {2004,2446.909499}, … {1986,693.873475}]}, {"WB_WDI_GC_DOD_TOTL_GD_ZS","Central government debt, total (% of GDP)","Weighted average",[{2019,74.808252}, {2017,74.523341}, … {1995,29.450991}]}, … {"WB_WDI_SP_DYN_LE00_IN","Life expectancy at birth, total (years)","Weighted average",[{2019,79.467}, {2004,75.951}, … {1973,67.107}]}]
ARG	{"Argentina","Latin America & Caribbean","Upper middle income","Buenos Aires","-34.6118","-58.4173"}	[{"WB_WDI_SP_DYN_LE00_IN","Life expectancy at birth, total (years)","Weighted average",[{2019,76.847}, {2004,74.871}, … {1970,65.647}]}, {"WB_WDI_SP_POP_TOTL","Population, total","Sum",[{2004,3.8815916e7}, {2013,4.2582455e7}, … {1970,2.3878327e7}]}, … {"WB_WDI_NY_GDP_MKTP_CD","GDP (current US$)","Gap-filled total",[{2019,4.4775e11}, {2004,1.6466e11}, … {1970,3.1584e10}]}]

Loading from DuckDB

DuckDB and Polars are a powerful combination for data analysis. DuckDB is an in-process SQL OLAP database management system, which means it runs inside the same process as the application. This makes it incredibly fast for analytical queries.

The .pl() method provides a seamless way to convert the result of a DuckDB query into a Polars DataFrame, allowing you to move data efficiently between the two tools.

In the example below, we are doing a simple SELECT * FROM query to select all of the data from a single table. In the more detailed examples below, we show how to combine the power of DuckDB and Polars for more advanced analytics.

import duckdb

with duckdb.connect(f'{DUCKDB_PATH}', read_only=True) as duckdb_connection:
    duck_data = duckdb_connection.sql('SELECT * FROM data').pl()

duck_data.glimpse()

Rows: 9715
Columns: 4
$ country_code    'VNM', 'VNM', 'GBR', 'CAN', 'CHL', 'CHN', 'VNM', 'GBR', 'USA', 'VNM'
$ year            2011, 2018, 2016, 2010, 2010, 2010, 1999, 2009, 2009, 2009
$ indicator_code  'WB_WDI_EG_USE_PCAP_KG_OE', 'WB_WDI_EG_USE_PCAP_KG_OE', 'WB_WDI_EG_USE_PCAP_KG_OE', 'WB_WDI_EG_USE_PCAP_KG_OE', 'WB_WDI_EG_USE_PCAP_KG_OE', 'WB_WDI_EG_USE_PCAP_KG_OE', 'WB_WDI_EG_USE_PCAP_KG_OE', 'WB_WDI_EG_USE_PCAP_KG_OE', 'WB_WDI_EG_USE_PCAP_KG_OE', 'WB_WDI_EG_USE_PCAP_KG_OE'
$ value           663.57444, 885.646842, 2696.97999, 7660.754521, 1795.68173, 1899.799961, 358.075426, 3144.002013, 7053.358467, 615.142147

Data Exploration

Once your data is loaded, Polars provides several methods to explore and understand it.

We have already shown a few examples above:

polars.DataFrame.head() - display the first N rows of the dataframe.
polars.DataFrame.tail() - display the last N rows of the dataframe.
polars.DataFrame.glimpse() - shows the values of the first few rows of a dataframe, but formats the output differently from head and tail. Here, each line of the output corresponds to a single column, making it easier to inspect wider dataframes.
polars.DataFrame.sample() - samples n random rows from a DataFrame.
polars.DataFrame.columns - provides a list of the columns in the dataframe.
polars.DataFrame.describe() - generates summary statistics for the dataframe.

# Display the summary statistics for the dataframe.
countries.describe()

statistic	country_code	iso2_code	country_name	region	region_id	income_level	capital_city	longitude	latitude
str	str	str	str	str	str	str	str	f64	f64
count	296	296	296	296	296	296	211	211.0	211.0
null_count	0	0	0	0	0	0	85	85.0	85.0
mean	null	null	null	null	null	null	null	19.139549	18.889009
std	null	null	null	null	null	null	null	70.391069	24.210877
min	ABW	1A	Afghanistan	Aggregates	EAS	Aggregates	Abu Dhabi	-175.216	-41.2865
25%	null	null	null	null	null	null	null	-13.7	4.60987
50%	null	null	null	null	null	null	null	19.2595	17.3
75%	null	null	null	null	null	null	null	50.5354	40.0495
max	ZWE	ZW	Zimbabwe	Sub-Saharan Africa	SSF	Upper middle income	Zagreb	179.089567	64.1836

Polars Expressions

So far, we've covered the basics but where Polars starts to differentiate is through its composable, expressive API that adopts a functional programming approach - a good fit with data processing use cases.

The Polars team have worked hard to make the domain specific language (DSL) consistent and therefore intuitive to use.

The foundation of the language is the concept of "expressions": composable building blocks that each specialise in a specific data wrangling task such as:

Filtering - filtering to a specific subset of data based on the values in specific columns.
Aggregating, summarising - often applied when grouping up data based on categorical values.
Selecting - trimming the dataframe down to specific columns that you want to analyse or display.
Transforming - pivoting, unpivoting and other operations to unpack complex structures such as arrays and dictionaries.
Joining - joining dataframes based on a relationship and specifying a join strategy (e.g. an "inner join" or "left outer join").
Calculated columns - adding new columns which are derived from others in the dataframe.
Cleaning - a diverse range of expressions are available to help clean up data, some examples include:
- Handling empty data - dropping nulls or using strategies such as forward filling to fill in data where it missing.
- Dropping duplicates.
- Adding unique IDs.

There are more specialised expressions which are generally organised under a specific namespace. For example polars.Expr.str is the namespace under which string based expressions are organised.

Polars expressions are functional abstractions over a Series, where a Series is an array of values with the same data type, e.g. List[polars.Int64]. They are often the contents of a specific column in your Polars dataframe, but they can also be created through other means (e.g. as a derived, intermediate result in a chain of expressions).

Each expression is elegantly simple: they take a Series as input and produce a Series as output. Because the input and output types are the same, expressions can be chained indefinitely, making them composable.

Ritchie Vink draws a compelling analogy: "just as Python's vocabulary is small (if, else, for, lists) yet can express anything through combination, Polars gives you a limited set of operations that combine to handle use cases the developers never anticipated. You learn a small API surface, then apply that knowledge everywhere."

With composable expressions, you stay within Polars' DSL. The engine can analyse, optimise, and parallelise your logic because it understands what you're doing.

In the examples below, we aim to show the flexibility and power of Polars expressions by applying them to the World Bank data based on common data wrangling scenarios.

Filtering and selecting

We've been asked to prepare an official list of countries to be used across our organisation as a "source of truth" across different types of analytics. This gives us an opportunity to show some of the most frequently used Polars functions: filter, select and sort.

The World Bank is deemed to be the source, but we spot that raw data for countries contains "aggregate" level results on top of data for individual countries. Therefore, we use polars.DataFrame.filter to filter the dataframe to only retain the country level data, using the ~ operator to negate the .is_in condition.

We then use polars.Dataframe.select to select the columns we want to publish from the range of columns available in the source dataframe.

Finally, we use the polars.DataFrame.sort expression so it is ordered by country name.

We chain these operations using a pattern commonly seen in functional programming style. This enables Polars to see the end to end intent of the operation and optimise it accordingly. It also provides code which is easy to read and maintain.

Polars works well with modern visualisation libraries such as plotly. So we can bring our final dataset to life by projecting it onto a map of the World.

# Inspection of the "region" column to see unique values shows an "Aggregates" category which we want to filter out.
countries["region"].unique()

region
str
Sub-Saharan Africa
Latin America & Caribbean
South Asia
East Asia & Pacific
North America
Aggregates
Europe & Central Asia
Middle East, North Africa, Afg…

# We don't want to publish all of the columns in the countries dataframe.
countries.columns

    ['country_code',
     'iso2_code',
     'country_name',
     'region',
     'region_id',
     'income_level',
     'capital_city',
     'longitude',
     'latitude']

countries = (
    countries
    .filter(~pl.col("region").is_in(["Aggregates"]))  # Filter the data, using the ~ operator to negate the is_in condition so we exclude "Aggregates".
    .select(["country_code", "iso2_code", "country_name", "region", "capital_city", "longitude", "latitude"])  # Select only relevant columns.
    .sort(["country_name"])  # Sort by country name.
)
countries

country_code	iso2_code	country_name	region	capital_city	longitude	latitude
str	str	str	str	str	f64	f64
AFG	AF	Afghanistan	Middle East, North Africa, Afg…	Kabul	69.1761	34.5228
ALB	AL	Albania	Europe & Central Asia	Tirane	19.8172	41.3317
DZA	DZ	Algeria	Middle East, North Africa, Afg…	Algiers	3.05097	36.7397
ASM	AS	American Samoa	East Asia & Pacific	Pago Pago	-170.691	-14.2846
AND	AD	Andorra	Europe & Central Asia	Andorra la Vella	1.5218	42.5075
…	…	…	…	…	…	…
VIR	VI	Virgin Islands (U.S.)	Latin America & Caribbean	Charlotte Amalie	-64.8963	18.3358
PSE	PS	West Bank and Gaza	Middle East, North Africa, Afg…	null	null	null
YEM	YE	Yemen, Rep.	Middle East, North Africa, Afg…	Sana'a	44.2075	15.352
ZMB	ZM	Zambia	Sub-Saharan Africa	Lusaka	28.2937	-15.3982
ZWE	ZW	Zimbabwe	Sub-Saharan Africa	Harare	31.0672	-17.8312

# Display the results on a geographical scatter plot.
fig = px.scatter_geo(
    countries,
    lat="latitude",
    lon="longitude",
    hover_name="country_name",
    color="region",           # Color points by region
    projection="natural earth",
    title="World Bank Data - Country Locations"
)

fig.show()

Calculated columns

In the next example, we want to answer a simple question "How has the population (in millions) of the United Kingdom grown year on year over the last 50 years?".

In the dataset we loaded from Parquet, We have the population data in a column called WB_WDI_SP_POP_TOTL.

We need to filter this data in two dimensions:

We only want data for the "United Kingdom"
We only want data for the last 50 years.

We use polars.DataFrame.with_columns to add a new calculated column which converts the results into millions (making it easier for humans to reason with the data).

Next we need to sort the data in ascending order by year so we can then use the polars.Series.shift expression to enable the percentage change year on year to be calculated as a new column.

We then add a column using the polars.when function to create a final new column "color" which is set to a literal value of "green" when positive population growth, otherwise it is "red".

NUMBER_OF_YEARS = 50

uk_population = (
    metrics
    .filter((pl.col("country_name") == "United Kingdom") & (pl.col("year") > (int(metrics["year"].max()) - NUMBER_OF_YEARS)))  # Filter in one step based on country name and year.
    .with_columns((pl.col("WB_WDI_SP_POP_TOTL") / 1000000).alias("population_in_millions"))  # Add a new column for population in millions.
    .sort("year", descending=False)
    .with_columns(
        (((pl.col("population_in_millions") - pl.col("population_in_millions").shift(1))) / pl.col("population_in_millions").shift(1) * 100)
        .alias("population_change_percentage")  # Add new column for population change percentage.
    )
    .with_columns(
        pl.when(pl.col("population_change_percentage") > 0)
        .then(pl.lit("green"))
        .otherwise(pl.lit("red"))
        .alias("color")  # Add new column for color based on population change.
    )
    .select(["year", "population_in_millions", "population_change_percentage", "color"])
)

uk_population

year	population_in_millions	population_change_percentage	color
i64	f64	f64	str
1975	56.2258	null	red
1976	56.211968	-0.024601	red
1977	56.193492	-0.032868	red
1978	56.196504	0.00536	green
1979	56.246951	0.089769	green
…	…	…	…
2020	66.744	0.169591	green
2021	66.984	0.359583	green
2022	67.604	0.925594	green
2023	68.492	1.313532	green
2024	69.226	1.071658	green

# Create subplots with 2 rows and 1 column
fig = make_subplots(
    rows=2, cols=1,
    shared_xaxes=True,
    vertical_spacing=0.1,  # Space between charts
    row_heights=[0.7, 0.3],  # 70% height for main chart, 30% for bar chart
    subplot_titles=("Total Population (Millions)", "Year-on-Year Growth (%)")
)

# Top Chart: Absolute Population (Line + Markers)
fig.add_trace(
    go.Scatter(
        x=uk_population["year"],
        y=uk_population["population_in_millions"],
        mode="lines+markers",
        name="Population",
        line=dict(width=3)
    ),
    row=1, col=1
)

# Bottom Chart: Percentage Change (Bar)
fig.add_trace(
    go.Bar(
        x=uk_population["year"],
        y=uk_population["population_change_percentage"],
        marker_color=uk_population["color"],  # Use the calculated red/green column
        name="Change %"
    ),
    row=2, col=1
)

# Update layout configuration
fig.update_layout(
    title_text="United Kingdom Population Analysis (Last 50 Years)",
    showlegend=False
)

fig.show()

Unpacking complex types

In the data loading examples above, we loaded a set of JSON files (one per country) that contained metrics for each country as an array of complex type.

Polars provides two useful functions to enable you unpack this type of data:

The polars.DataFrame.unnest function is used to decompose struct columns, creating one new column for each of their fields. For example the column country_info contains a dictionary-like structure with data such as {"country_name": "Argentina", "region": "Latin America & Caribbean ", "income_level": "Upper middle income", "capital_city": "Buenos Aires", "longitude": "-34.6118", "latitude": "-58.4173"}, when we call the unnest operation on this column, it creates 4 new columns (country_name, region, income_level, capital_city, longitude and latitude) and populates these columns with the values in those respective elements of the dictionary-like structure.
The polars.DataFrame.explode method is used to unpack columns which contain an array (list) object. Exploding the data by creating a row for each unique value in the array.

The net result is we can flatten out the nested data into a tabular form, making it ready for downstream analytics.

json_metrics = (
    json_metrics
    .explode("indicators")  # Turn list of indicators into individual row for each indicator
    .unnest("indicators")  # Unpack the indicators object into individual components
    .explode("data_points")  # Explode the data_points (a list of {"year: XXXX, "value": XXXX}) into individual rows
    .unnest("data_points")  # Unpack the datapoints into separate columns for `year` and `value`
    .unnest("country_info")  # Unpack the country_info object into individual columns
)
json_metrics

country_code	country_name	region	income_level	capital_city	latitude	longitude	indicator_code	indicator_name	aggregation_method	year	value
str	str	str	str	str	str	str	str	str	str	i64	f64
ALB	Albania	Europe & Central Asia	Upper middle income	Tirane	41.3317	19.8172	WB_WDI_NY_GDP_PCAP_CD	GDP per capita (current US$)	Weighted average	2019	6069.439031
ALB	Albania	Europe & Central Asia	Upper middle income	Tirane	41.3317	19.8172	WB_WDI_NY_GDP_PCAP_CD	GDP per capita (current US$)	Weighted average	2004	2446.909499
ALB	Albania	Europe & Central Asia	Upper middle income	Tirane	41.3317	19.8172	WB_WDI_NY_GDP_PCAP_CD	GDP per capita (current US$)	Weighted average	2013	4542.929036
ALB	Albania	Europe & Central Asia	Upper middle income	Tirane	41.3317	19.8172	WB_WDI_NY_GDP_PCAP_CD	GDP per capita (current US$)	Weighted average	2000	1160.420471
ALB	Albania	Europe & Central Asia	Upper middle income	Tirane	41.3317	19.8172	WB_WDI_NY_GDP_PCAP_CD	GDP per capita (current US$)	Weighted average	2008	4498.504868
…	…	…	…	…	…	…	…	…	…	…	…
ZAF	South Africa	Sub-Saharan Africa	Upper middle income	Pretoria	-25.746	28.1871	WB_WDI_SL_UEM_TOTL_ZS	Unemployment, total (% of tota…	Weighted average	2000	22.714
ZAF	South Africa	Sub-Saharan Africa	Upper middle income	Pretoria	-25.746	28.1871	WB_WDI_SL_UEM_TOTL_ZS	Unemployment, total (% of tota…	Weighted average	1999	22.791
ZAF	South Africa	Sub-Saharan Africa	Upper middle income	Pretoria	-25.746	28.1871	WB_WDI_SL_UEM_TOTL_ZS	Unemployment, total (% of tota…	Weighted average	1995	22.647
ZAF	South Africa	Sub-Saharan Africa	Upper middle income	Pretoria	-25.746	28.1871	WB_WDI_SL_UEM_TOTL_ZS	Unemployment, total (% of tota…	Weighted average	1991	23.002
ZAF	South Africa	Sub-Saharan Africa	Upper middle income	Pretoria	-25.746	28.1871	WB_WDI_SL_UEM_TOTL_ZS	Unemployment, total (% of tota…	Weighted average	1996	22.48

json_metrics["indicator_code"].unique()

indicator_code
str
WB_WDI_SP_POP_TOTL
WB_WDI_GC_DOD_TOTL_GD_ZS
WB_WDI_EG_USE_PCAP_KG_OE
WB_WDI_SE_ADT_LITR_ZS
WB_WDI_NY_GDP_MKTP_CD
WB_WDI_NY_GDP_PCAP_CD
WB_WDI_SL_UEM_TOTL_ZS
WB_WDI_SP_DYN_LE00_IN

Complex transformation, leveraging DuckDB and Polars

When you pair DuckDB with Polars, you get the best of both worlds:

High-Performance SQL: Use DuckDB's fast SQL engine to perform initial filtering, aggregation, and data manipulation at the database level.
Expressive DataFrame API: Load the results directly into a Polars DataFrame to leverage its powerful and expressive API for more complex transformations and analysis.

In this example, we are first going to use a DuckDB prepare the data through a more complex query which joins two tables and filters the data through a WHERE clause.

with duckdb.connect(f'{DUCKDB_PATH}', read_only=True) as duckdb_connection:
    
    duckdb_query_results = duckdb_connection.sql(
        """
        SELECT d.country_code, c.country_name, c.region, d.year, d.indicator_code, d.value 
        FROM data d
        JOIN countries c ON d.country_code = c.country_code
        WHERE d.indicator_code IN ('WB_WDI_SP_POP_TOTL', 'WB_WDI_SP_DYN_LE00_IN', 'WB_WDI_NY_GDP_PCAP_CD')
        """
        ).pl()

duckdb_query_results.head(3)

country_code	country_name	region	year	indicator_code	value
str	str	str	i64	str	f64
VNM	Viet Nam	East Asia & Pacific	2011	WB_WDI_NY_GDP_PCAP_CD	1950.925042
VNM	Viet Nam	East Asia & Pacific	2018	WB_WDI_NY_GDP_PCAP_CD	3222.310031
GBR	United Kingdom	Europe & Central Asia	2016	WB_WDI_NY_GDP_PCAP_CD	41257.908555

Next we are going to use a polars.DataFrame.pivot operation to transform the shape of the dataframe and get the data ready to plot on a chart.

world_wealth_and_health = (
    duckdb_query_results
    .pivot(
        on=["indicator_code"],
        index=["country_code", "country_name", "region", "year"],
        values="value"
        )
    .rename(
        {
            "WB_WDI_NY_GDP_PCAP_CD": "gdp_usd_per_capita",
            "WB_WDI_SP_DYN_LE00_IN": "life_expectancy",
            "WB_WDI_SP_POP_TOTL": "population"
        }
        )
    .drop_nulls(subset=["gdp_usd_per_capita", "life_expectancy", "population"])  # Drop rows with nulls in any of the key metrics
    .with_columns(
        [
            (pl.col("population") / 1000000).round(2).alias("population_in_millions"),
        ]
    )
    .sort(["year", "country_code"])
)

world_wealth_and_health.head(3)

country_code	country_name	region	year	gdp_usd_per_capita	life_expectancy	population	population_in_millions
str	str	str	i64	f64	f64	f64	f64
ARG	Argentina	Latin America & Caribbean	1970	1322.714542	65.647	2.3878327e7	23.88
AUS	Australia	East Asia & Pacific	1970	3309.763063	71.018537	1.2507e7	12.51
BGD	Bangladesh	South Asia	1970	130.218161	42.667	6.9058894e7	69.06

Finally we chart the results to show the snail trail of each country over time on a two dimensional scatter chart.

fig = px.scatter(
    world_wealth_and_health,
    x="gdp_usd_per_capita",
    y="life_expectancy",
    animation_frame="year",
    animation_group="country_name",
    size="population_in_millions",
    color="region",
    hover_name="country_name",
    log_x=True,
    size_max=55,
    range_x=[100, 100000],
    range_y=[25, 90],
    title="World Wealth and Health Over Time",
    labels={
        "gdp_usd_per_capita": "Wealth (GDP Per Capita in USD)",
        "life_expectancy": "Health (Life Expectancy in Years)"
    }
)
fig.show()

Lazy loading

# Read CSV using lazy frame
data = pl.scan_parquet(PARQUET_FOLDER, hive_partitioning=True)

# The `scan_parquet returns a lazy frame not the data,  But it does inspect the files.
data

naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

Parquet SCAN [../data/world_bank/parquet/year=1970/af92515fa293459c9aa99c6433ee2cda.parquet, ... 54 other sources]

PROJECT 11/12 COLUMNS

ESTIMATED ROWS: 1760

# We can inspect the schema without loading data
data.collect_schema()

Schema([('country_code', String),
        ('year', Int64),
        ('WB_WDI_EG_USE_PCAP_KG_OE', Float64),
        ('WB_WDI_GC_DOD_TOTL_GD_ZS', Float64),
        ('WB_WDI_NY_GDP_MKTP_CD', Float64),
        ('WB_WDI_NY_GDP_PCAP_CD', Float64),
        ('WB_WDI_SE_ADT_LITR_ZS', Float64),
        ('WB_WDI_SL_UEM_TOTL_ZS', Float64),
        ('WB_WDI_SP_DYN_LE00_IN', Float64),
        ('WB_WDI_SP_POP_TOTL', Float64),
        ('country_name', String),
        ('region', String)])

# Start to build up some operations on the lazy frame
data = (
    data
    .rename({"WB_WDI_SP_POP_TOTL": "population"})  # Rename the value column to population
)

# We want use 1980 as the base year for our analysis
data = data.filter(pl.col("year") >= 1980)

# Normalise each country's population so it has a maximum of 1 for all years since 1980
data = (
    data
    .with_columns(
        pl.col("population")
        .max()
        .over("country_code")
        .alias("max_population")  # Get max population per country
    )
    .with_columns(
        (pl.col("population") / pl.col("max_population")).alias("normalized_population")  # Normalized population
    )
    .drop("max_population")  # Drop the intermediate column
)

# Select the final set of columns we want to publish
data = data.select([
    "country_code",
    "country_name",
    "region",
    "year",
    "population",
    "normalized_population",
])

# We only want to see data for a selection of countries and from 1980 onwards
data = (
    data
    .filter(pl.col("country_code").is_in(["CHN", "UK", "ALB", "JPN", "NZL", "CAN"]))  # Filter for specific countries and years
)

At this stage we haven't executed the steps we have built up above. We compare the query plans generated by explain(optimized=True) versus explain(optimized=False) - each shows nested sets of steps in reverse order. A quick scan shows the following key differences:

The un-optimized plan follows the steps in the order we defined them above.

# The un-optimized plan shows all the steps we have built up above.
print(data.explain(optimized=False))

FILTER col("country_code").is_in([["CHN", "UK", … "CAN"]])
FROM
  SELECT [col("country_code"), col("country_name"), col("region"), col("year"), col("population"), col("normalized_population")]
    SELECT [col("country_code"), col("year"), col("WB_WDI_EG_USE_PCAP_KG_OE"), col("WB_WDI_GC_DOD_TOTL_GD_ZS"), col("WB_WDI_NY_GDP_MKTP_CD"), col("WB_WDI_NY_GDP_PCAP_CD"), col("WB_WDI_SE_ADT_LITR_ZS"), col("WB_WDI_SL_UEM_TOTL_ZS"), col("WB_WDI_SP_DYN_LE00_IN"), col("population"), col("country_name"), col("region"), col("normalized_population")]
        WITH_COLUMNS:
        [[(col("population")) / (col("max_population"))].alias("normalized_population")] 
          WITH_COLUMNS:
          [col("population").max().over([col("country_code")]).alias("max_population")] 
          FILTER [(col("year")) >= (1980)]
          FROM
            SELECT [col("country_code"), col("year"), col("WB_WDI_EG_USE_PCAP_KG_OE"), col("WB_WDI_GC_DOD_TOTL_GD_ZS"), col("WB_WDI_NY_GDP_MKTP_CD"), col("WB_WDI_NY_GDP_PCAP_CD"), col("WB_WDI_SE_ADT_LITR_ZS"), col("WB_WDI_SL_UEM_TOTL_ZS"), col("WB_WDI_SP_DYN_LE00_IN"), col("WB_WDI_SP_POP_TOTL").alias("population"), col("country_name"), col("region")]
              Parquet SCAN [../data/world_bank/parquet/year=1970/af92515fa293459c9aa99c6433ee2cda.parquet, ... 54 other sources]
              PROJECT 11/12 COLUMNS
              ESTIMATED ROWS: 1760

The optimized plan shows less steps and a different ordering of operations because Polars has applied multiple levels of optimization:

Predicate Pushdown: Look for SELECTION in the scan node. Polars has pushed the filter logic down to the data access layer. Instead of reading all rows into memory and then filtering them, it applies the filter during the scan, discarding non-matching rows immediately.
Projection Pushdown: In the Optimized plan, look for PROJECT */* COLUMNS. Polars analysed your query and determined exactly which columns are needed. It will strictly only read those specific columns from disk, ignoring the rest to save memory and I/O.
Intelligent Scan Optimisation: This combines Partition Pruning and Parquet Statistics.
- Partition Pruning: Because we used hive_partitioning=True and filtered on year, Polars checks the folder names first. It completely skips opening files for years 1960-1979, only reading files that match the filter.
- Row Group Statistics: Unique to Parquet (vs CSV/JSON), these files contain metadata with min/max values for chunks of data ("Row Groups"). If we filtered on a data column, Polars would check these stats and skip reading entire chunks of the file if they couldn't possibly contain matching data.

# The optimized plan shows how Polars will execute the query efficiently.
print(data.explain())

simple π 6/6 ["country_code", "country_name", ... 4 other columns]
    WITH_COLUMNS:
    [[(col("population")) / (col("max_population"))].alias("normalized_population")] 
      WITH_COLUMNS:
      [col("population").max().over([col("country_code")]).alias("max_population")] 
      SELECT [col("country_code"), col("year"), col("WB_WDI_SP_POP_TOTL").alias("population"), col("country_name"), col("region")]
        Parquet SCAN [../data/world_bank/parquet/year=1980/77399e87ba29443cb65b5a3fff361036.parquet, ... 44 other sources]
        PROJECT 5/12 COLUMNS
        SELECTION: [([(col("year")) >= (1980)]) & (col("country_code").is_in([["CHN", "UK", … "CAN"]]))]
        ESTIMATED ROWS: 1440

Now we can actually run the end to end logic to ingest the data and perform the optimised chain of operations on it by calling the polars.lazyframe.collect method.

# Get the data by executing `.collect()` on the lazy frame
data.collect()

country_code	country_name	region	year	population	normalized_population
str	str	str	i64	f64	f64
NZL	New Zealand	East Asia & Pacific	1980	3.1129e6	0.588728
JPN	Japan	East Asia & Pacific	1980	1.16807e8	0.912056
CAN	Canada	North America	1980	2.4515667e7	0.593764
CHN	China	East Asia & Pacific	1980	9.81235e8	0.694749
ALB	Albania	Europe & Central Asia	1980	2.671997e6	0.813012
…	…	…	…	…	…
CAN	Canada	North America	2024	4.1288599e7	1.0
JPN	Japan	East Asia & Pacific	2024	1.23975371e8	0.968028
NZL	New Zealand	East Asia & Pacific	2024	5.2875e6	1.0
ALB	Albania	Europe & Central Asia	2024	2.377128e6	0.723292
CHN	China	East Asia & Pacific	2024	1.4090e9	0.997603

This may seem like a lot of additional steps for this relatively small amount of demo data we are using. But as you scale up to many Gigabytes with thousands of Parquet files in a lakehouse architecture, this approach will generate significant performance gains.

Final step is to bring our analysis to life with a chart.

# Plot the result one line per country showing normalized population over time
line_chart = px.line(
    data.collect(),
    x="year",
    y="normalized_population",
    color="country_name",
    title="Normalized Population Growth Since 1980",
    markers=True
)
line_chart.show()

Streaming Execution

In the example above, we used .collect() to materialize the final result into memory. For the amount of data we are working with here, this is perfectly fine.

However, one of Polars' most powerful features is its Streaming Engine.

If your dataset is larger than your machine's available RAM, a standard .collect() would result in an "Out of Memory" (OOM) error. By simply passing the streaming=True argument, you instruct Polars to process the data in batches.

It effectively pipelines the data processing, reading a chunk of data, processing it, and keeping only the results needed (e.g. the aggregated counts or the filtered rows) before moving on to the next chunk.

This allows you to process 100GB+ datasets on a standard laptop!

Conclusion

Hopefully, these worked examples have given you a flavour of Polars and provided some useful tips for applying it to your own use cases.

Throughout this notebook, we've demonstrated the key pillars that make Polars a game-changer for data engineering in Python, enabling you to express complex data transformations in a clear, concise, and performant way:

Seamless Data Ingestion: First-class support for common formats like Parquet, CSV, and JSON makes loading data trivial.
Expressive, Composable API: The functional API design allows you to build complex logic that remains readable and maintainable.
Performance by Design: Under the hood, the Rust-based engine leverages vectorized execution and parallelization to handle heavy data wrangling tasks effortlessly.
Lazy Evaluation: By switching to LazyFrames, you hand control to the Query Optimizer. This unlocks techniques like Predicate Pushdown and Projection Pushdown, which can deliver huge performance gains just by letting Polars decide how to execute your query.
Streaming Execution: The streaming=True option breaks the memory barrier, allowing you to process datasets larger than your machine's RAM without needing a cluster.

As you become more familiar with Polars' capabilities, you'll find that it allows you to handle increasingly complex data tasks with elegance and efficiency. The expression-based API, lazy evaluation, and thoughtful design make it a powerful tool for modern data analysis.

This is Part 3 of our Adventures in Polars series:

Part 1: Why Polars Matters - The Decision Makers Guide for Polars.
Part 2: What Makes Polars So Scalable and Fast? - The technical deep-dive: lazy evaluation, query optimisation, parallelism, and the Rust foundation.
Part 4: Polars Workloads on Microsoft Fabric - Running Polars on Fabric with OneLake integration.

What data analysis tasks have you tackled with Polars? Have you found particularly elegant solutions to common problems? Share your experiences in the comments below!

quote from talk "What Polars does for you" presented at Europython Conference in 2023.↩

Under the Hood: What Makes Polars So Scalable and Fast?

Barry Smart — Thu, 29 Jan 2026 05:32:00 GMT

TL;DR: Polars' impressive performance (5-20x faster than Pandas) comes from multiple architectural innovations working together: a Rust foundation provides low-level performance and memory control; a columnar storage model optimizes analytical workloads; lazy evaluation enables a sophisticated query optimizer that can rearrange, combine, and streamline operations; parallel execution automatically distributes work across CPU cores; and vectorized processing maximizes modern CPU capabilities. By bringing database optimization techniques to DataFrame operations, Polars delivers exceptional performance while maintaining an elegant API.

In our previous article, we introduced Polars as a next-generation DataFrame library that's taking the Python data ecosystem by storm. We explored its origin story, key features, and how it fits into the broader data landscape.

Now, let's look under the hood to understand how Polars achieves its remarkable performance. This isn't just an academic exercise - understanding these mechanisms will help you write more efficient code, debug performance issues, and make informed decisions about when and how to use Polars in your projects.

Ritchie Vink, Polars' creator, often emphasizes that Polars' speed comes from multiple factors working together rather than a single performance trick. This philosophy mirrors the "aggregation of marginal gains" strategy championed by Sir Dave Brailsford, who led the British Olympic cycling team to world dominance. Just as Brailsford believed that a 1% improvement in many small areas would cumulate in significant competitive advantage, Polars achieves its blazing speed not through a single breakthrough, but by meticulously optimizing a multitude of small details. Let's examine each of these factors in detail.

The Origin Story: From Performance Challenge to DataFrame Revolution

Every successful open-source project has an origin story, and Polars' begins with Ritchie Vink.

In late 2019, while learning Rust (then an emerging systems programming language), Vink faced a practical problem: he needed to join two CSV files efficiently. Rather than setting up a database for this seemingly simple task, he decided to implement his own join algorithm in Rust.

When he benchmarked his implementation against popular Python package Pandas, the results were disappointing: his code was slower. For most developers, this might have been the end of the experiment. For Vink, it was the beginning of a journey.

"This unsatisfying result planted the seed of what would later become Polars," Vink explains¹. The challenge sparked his curiosity: why was Pandas faster, and how could he improve his implementation?

As he dove deeper into database engines, algorithms, and performance optimization, his goals evolved. What began as a simple join algorithm grew into a DataFrame package for Rust, and eventually into a high-performance query engine designed to rival industry standards in the Python ecosystem.

The name "Polars" itself carries a playful significance - the polar bear representing something stronger than a panda (a nod to the incumbent Pandas library), with the "rs" suffix reflecting its Rust foundation.

In March 2021, Vink released Polars on PyPI, initially as a research project. Its exceptional performance quickly gained attention, and by 2023, Polars had grown into its own company, with Vink at the helm as its creator and CEO.

Designed for Analytical Workloads

Polars is specifically designed for analytical processing (OLAP) rather than transactional workloads (OLTP). This means it excels at operations common in data analysis:

Aggregations across large datasets
Complex joins between tables
Filter operations that reduce large datasets
Transformations that reshape or derive new columns
Time series operations

This analytical focus drives decisions throughout Polars' design, from its columnar storage (optimal for reading subsets of columns) to its execution model (optimized for scanning and processing large volumes of data).

Built on Database Research

Unlike many DataFrame libraries that evolved organically from array manipulation libraries (Pandas is built on NumPy), Polars applies decades of database research to DataFrame operations. This brings sophisticated query optimization techniques, columnar processing, and other database innovations directly to the Python data ecosystem.

Vink emphasizes this distinction: "Polars respects decades of relational database research"². This isn't just marketing - it's reflected in Polars' architecture, from its query optimizer to its columnar storage to its expression system.

The Rust Foundation: Performance from First Principles

At the core of Polars' performance advantage is its implementation in Rust, a systems programming language that offers several key advantages for a high-performance DataFrame library:

Zero-cost abstractions - Rust's compiler generates machine code that's as efficient as hand-written C, without the safety risks
Fine-grained memory control -direct control over memory allocation and layout
No garbage collection - predictable performance without GC pauses
Memory safety guarantees - protection against common bugs like buffer overflows and use-after-free
Fearless concurrency - safe parallelism without data races

Unlike Pandas, which is built on a mix of Python, Cython, and C through NumPy, Polars is written entirely in Rust from the ground up. This means every performance-critical component - from memory management to algorithm implementation - can be optimized with low-level control.

Python Bindings: Best of Both Worlds

While Polars' core is Rust, it exposes a carefully designed Python API through bindings. This gives users the convenience and familiarity of Python with the performance of Rust:

# What you write in Python
result = df.select(
    pl.col("value").sum()
)

What actually happens:

The Python code builds an abstract query plan
This plan is passed to the Rust engine (via bindings, within the same OS process)
The Rust engine optimizes the plan
The Rust engine executes the plan, releasing Python's Global Interpreter Lock (GIL) so it can run multi-threaded without Python involvement
Results are returned to Python as Arrow-formatted memory buffers: Python receives a pointer to Rust-managed memory, not a copy of the data

The last stage above is important: because Arrow is a specification, different tools that conform to it can share data without serialization:

As Vink explains: "if you know that a process can deal with Arrow data you can say 'I have some memory laying around here, it's laid out according to the Arrow specification' - at that point you can say to another process 'this is the specification and this is the pointer to where the data is'. If you read this according to this specification you can use this data as-is without needing to serialize any data"¹

This is the zero-copy benefit. When Polars returns results to Python there's no conversion - just a pointer to Arrow-formatted memory.

This architecture minimizes the "Python tax" - performance-critical computation happens in Rust-managed memory and threads, while Python remains a thin orchestration layer. This is why using apply with Python lambdas is discouraged: it forces Polars to acquire the GIL, blocking parallel execution.

Columnar Architecture: Designed for Analytical Workloads

Polars stores data in columnar format, conforming to the Apache Arrow specification. This architecture provides the potential for significant performance gains - but realising that potential requires an engine built to exploit it.

What the columnar format enables:

Better compression: Homogeneous data types stored contiguously compress more efficiently (see below)
Selective I/O: Column-oriented storage makes it possible to read only needed columns
Cache-friendly access: Contiguous memory layout allows efficient CPU cache utilisation
SIMD potential: Homogeneous data can be processed with vectorised CPU instructions

What Polars adds on top:

These benefits don't materialise automatically - they require an engine purpose-built to exploit them. As Vink emphasises: "We've written Polars from scratch - every compute is from scratch."² The Arrow specification provides the memory layout; Polars provides the query engine that makes it fast.

Let's compare row-based and columnar storage visually:

Row-based storage (like traditional databases):

[Row 1: float1, string1, date1] -> [Row 2: float2, string2, date2] -> [Row 3: float3, string3, date3]

Columnar storage (like Polars):

value: [float1, float2, float3]
name:  [string1, string2, string3]
date:  [date1, date2, date3]

When computing something like SUM(value), a columnar system only needs to load the value array, while a row-based system loads all data, including unused columns.

Compression Benefits with Columnar Storage

The Apache Arrow columnar storage specification offers exceptional compression opportunities, particularly for columns with low cardinality (few unique values):

Dictionary encoding: For low-cardinality columns (like categories, countries, or status codes), values can be replaced with integer indexes into a dictionary of unique values:

Original: ["USA", "Canada", "USA", "Mexico", "Canada", "USA"]
Dictionary: ["USA"(0), "Canada"(1), "Mexico"(2)]
Encoded: [0, 1, 0, 2, 1, 0] (much smaller than storing the strings)

RLE: For columns with repeated consecutive values, store the value and count:

Original: [5, 5, 5, 5, 5, 7, 7, 7, 8, 8, 8, 8]
Encoded: [(5,5), (7,3), (8,4)] (value, count)

Delta encoding: For monotonically increasing values (like timestamps or IDs), store differences:

Original: [1000, 1005, 1010, 1015, 1020]
Encoded: [1000, 5, 5, 5, 5] (first value, then differences)

These compression techniques are particularly effective because each column contains homogeneous data types. In row-based storage, compression across different data types is much less efficient.

Metadata and Statistics for Query Optimization

File formats like Parquet (frequently used with Polars) use columnar storage and additionally store column-level statistics that enable powerful optimization when reading data:

Min/max statistics - each chunk of column data stores minimum and maximum values, allowing for predicate pushdown:

# If a data chunk has max_value=50 and your query is:
query = pl.scan_parquet("data.parquet").filter(pl.col("value") > 100)

# The entire chunk can be skipped without reading any data

Null counts and positions - track where NULL values appear, allowing for more efficient processing
Value distribution information - some formats store approximate histograms or count distinct estimates
Row groups and column chunks - data is organized into row groups with separate column chunks, making it possible to read only relevant portions

When scanning multiple Parquet files (a common Big Data scenario), these statistics become even more powerful. For example, when scanning a directory of parquet files such as the example below:

query = pl.scan_parquet("data/*.parquet").filter(pl.col("date") > "2023-01-01")

Polars can use file-level statistics to skip entire files without opening them if they don't contain relevant data.

This metadata-driven optimization is crucial for performance when working with large datasets spread across many files. It is particularly impactful on cloud storage platforms such as Amazon S3 or Microsoft OneLake which implement data lake or lakehouse architectures where data tends to be written in Parquet or Delta format. This is a key advantage of columnar formats that Polars fully leverages and we'll illustrate this in action in our next blog.

Apache Arrow Memory Model

Polars implements columnar storage using the Apache Arrow specification for in-memory analytical data storage, enabling seamless integration with a growing ecosystem of popular data tools which have adopted the same standard. This includes Apache Spark, pandas and DuckDB.

Apache Arrow is particularly popular for its ability to handle large datasets efficiently and its support for zero-copy data sharing between different tools which may have been written in different languages - this eliminates the need for a serialization and deserialization overhead. This feature is crucial for applications that require low-latency data access and processing, such as machine learning pipelines, data streaming systems, high-performance computing and data engineering, where you are often integrating multiple tools to deliver the solution.

Polars implements its own query engine while adhering to the Arrow specification for memory layout. This foundation allows Polars to efficiently process data without the overhead of converting between different memory representations.

This approach delivers two key benefits:

Zero-copy data sharing between processes and tools that understand Arrow.
Ecosystem compatibility with the growing universe of Arrow-enabled tools.

For example, because it adopts Apache Arrow, Polars can efficiently read and write Parquet files, exchange data with DuckDB, and convert to and from Pandas DataFrames with minimal overhead. This inter-operability helps with migration as it allows incremental adoption of Polars alongside legacy technology. This is illustrated below:

Composable Expression System

Polars' expression system represents perhaps its most elegant innovation from a user perspective. Expressions in Polars are more readable and give the optimizer visibility of the logic you want to apply.

Vink is emphatic about the importance of expressions: "we see the requirement of a Lambda... as sort of a failure of our API."¹ This philosophy drives continuous improvement of the expression system to make it increasingly flexible.

Here's an example of such a Polars expression from Part 3 of this series. We'll provide more background there, but hopefully you will see similarities with the PySpark SQL and DataFrame API and functional programming:

countries = (
    countries
    .filter(~pl.col("region").is_in(["Aggregates"]))
    .select(["country_code", "country_name", "region", "capital_city", "longitude", "latitude"])
    .sort(["country_name"])
)

This composable expression system is a domain specific language (DSL) which provides the foundation for further optimisations we set out below.

Lazy Evaluation and Query Optimization

One of Polars' most distinctive features is its query optimizer, which draws directly from database technology. When using Polars' lazy execution mode the user is building an expression, operations aren't performed immediately but collected into a query plan. Before execution, Polars analyzes this plan and applies optimizations that can yield orders of magnitude performance improvements, and users get them automatically without changing their code.

While Pandas executes operations immediately, Polars can defer execution to build and optimize a complete query plan.

This is similar to how LINQ Expression Trees work in .NET and in Reactive Queries in Rx & Reaqtor where the execution of the query is deferred until a result is needed, adopting the futures and promises design pattern.

How Lazy Evaluation Works

When using Polars' lazy API, operations don't execute immediately but instead build a logical query plan:

By using pl.scan_csv() in place of pl.read_csv() in the code below, the data is not loaded or processed. Instead it returns a polars.LazyFrame object which allows Polars to build a query plan.

plan = (
    pl.scan_csv("large_file.csv")
    .filter(pl.col("value") > 100)
    .group_by("category")
    .agg(pl.col("value").mean().alias("avg_value"))
)

It means that subsequent operations we may wish to add simply get added to the query plan, for example:

plan = (
    plan
    .filter(pl.col("category").is_in(["Category X", "Category Y", "Category Z"]))
    .sort("avg_value", descending=True)
)

Execution happens only when you call collect():

result = plan.collect()

By separating the stages of building, optimising and executing the plan, Polars can analyze the entire operation chain and apply optimizations.

The Query Optimizer

Polars' optimizer applies various transformations to the logical plan:

Predicate pushdown - move filters earlier to reduce data volume
Projection pushdown - only read necessary columns from source
Join optimization - select efficient join strategies based on data properties
Common subexpression elimination compute repeated expressions only once
Function simplification - replace complex operations with simpler equivalents

We show this in action in our next blog.

In benchmark tests, these optimizations alone can provide 5-10x performance improvements over naively executed queries.

Beyond Basic Optimizations

Polars' optimizer goes beyond simple rule-based transformations to apply more sophisticated optimizations:

Query rewriting - replace operation sequences with more efficient alternatives
Specialized algorithms - use purpose-built implementations for common patterns
Meta-optimizations - decide whether certain optimizations are worthwhile based on data characteristics

For example, if you write df.sort().head(10), Polars might replace this with a top-k algorithm that's much more efficient than sorting the entire dataset.

Parallel Execution: Using All Your Cores

Polars automatically parallelizes operations across all available cores. This isn't something users need to configure or enable, it happens transparently. In an era of 8, 16, or even more CPU cores on standard laptops (I have 20 on my Microfot Surface Studio laptop! 💪😎), this automatic parallelization represents a massive performance advantage without requiring any special coding patterns.

Unlike Pandas, which primarily operates on a single CPU core, Polars automatically parallelizes operations across all available cores by leveraging its Rust foundations. This parallelization happens in two complementary ways:

Parallel-Aware Query Nodes

Major operations like joins, group-bys, filters, and sorts know how to divide work across threads. Each node in the query plan can implement its own parallelization strategy based on the specific operation and data characteristics.

Expression Thread Pool

For expression evaluation, Polars uses a work-stealing thread pool:

Work is divided into manageable chunks
Chunks are distributed across a pool of worker threads
When a thread finishes its work, it "steals" pending work from other threads
This continues until all work is complete

This approach maximizes CPU utilization while avoiding the overhead of excessive thread creation and context switching.

The beauty of Polars' parallelism is that it's completely transparent. You don't need to explicitly parallelize your code or manage threads yourself.

The command pl.thread_pool_size() will return the number of threads that Polars is using - it is set automatically by the Polars engine and will generally equal the number of cores on your CPU. This can also be overriden by setting the POLARS_MAX_THREADS environment variable before process start.

Vectorized Execution: Batch Processing for Performance

Polars uses vectorized execution to process data efficiently by leveraging modern hardware and SIMD (see below). Rather than processing one value at a time (like traditional loops) or entire columns at once (which can exhaust memory), Polars processes data in optimally-sized batches.

The Goldilocks Zone: Vector Sizing

Polars processes data in vectors of 1024-2048 items. This size is carefully chosen to:

Fit in CPU L1 cache - typically 32-128KB per core on modern processors
Amortize function call overhead - processing batches reduces per-item overhead
Enable compiler optimizations - predictable sizes allow better code generation
Balance memory pressure - not too large to cause cache misses, not too small to waste cycles

This "Goldilocks" approach to batch sizing delivers significant performance benefits over both row-by-row and whole-column processing.

SIMD Instructions: One Instruction, Multiple Data

Modern CPUs include special registers that can process multiple values with a single instruction - known as SIMD (Single Instruction, Multiple Data). The specific capabilities vary by hardware:

Instruction Set	Register Width	Values per Operation (32-bit)	Platform
SSE2	128-bit	4	Intel/AMD (baseline since ~2001)
AVX2	256-bit	8	Intel/AMD (since ~2013)
AVX-512	512-bit	16	Intel Xeon, some consumer chips
NEON	128-bit	4	ARM (including Apple Silicon)

Without SIMD:

Instruction 1: a[0] + b[0] 
Instruction 2: a[1] + b[1] 
Instruction 3: a[2] + b[2] 
Instruction 4: a[3] + b[3]

With SIMD (one instruction does the work of four):

Instruction 1: [a[0],a[1],a[2],a[3]] + [b[0],b[1],b[2],b[3]]

Historically, exploiting SIMD required writing platform-specific assembly or intrinsics. This presented a significant portability challenge. However, modern compilers (GCC, Clang, and the Rust compiler LLVM backend) can now auto-vectorize code written in a certain style: tight loops, minimal branching, and predictable memory access patterns.

Polars is written to exploit SIMD instructions, through a combination of compiler auto-vectorization and, where necessary, manual implementation. The details vary by operation, but the result is transparent to the user: significant performance gains on modern hardware without platform-specific configuration.

Memory Management: Efficiency from the Ground Up

Polars' architecture includes careful attention to memory management:

Efficient Data Types and Memory Layout

Polars has adopted the Apache Arrow memory specification and therefore its data types are based on that specification. The advantages of this approach is that data in memory is optimized for both memory usage and processing speed:

Primitive types - stored as packed arrays of values, offered at different bit sizes: 8, 16, 32 and 64.
Decimal - 128 bit type, can exactly represent 38 significant digits.
String data - uses a string cache for repeated values
Categorical - optimal for encoding string based categorical columns which have low cardinality
Enum - similar to Categorical but the categories are fixed and must be defined prior to data being loaded
Temporal types - represented as efficient integers internally, Int32 for Date (days since Unix epoch) and Int64 for Datetime (ns, us, or ms since Unix epoch), Duration and Time (ns since midnight).
Nested - enable complex data structures to be modelled via Array (fixed length), List (any length) and Struct (key value pairs).
Missing values - uses "validity bitmaps" rather than sentinel values

Each of these choices reduces memory consumption compared to Pandas' approach.

Zero-Copy Operations Where Possible

Polars uses zero-copy operations whenever feasible:

Selecting columns is a zero-copy operation:

subset = df.select("a", "b", "c")  # No data is copied

Filtering can also be highly efficient:

filtered = df.filter(pl.col("a") > 0)  # Minimal memory overhead

This approach minimizes memory usage and improves performance by avoiding unnecessary data copying.

Spill to Disk for Large Workloads

For operations that don't fit in memory, Polars can transparently spill to disk:

Process data in manageable chunks
Write intermediate results to disk when memory pressure is high
Read back as needed for final results

This capability allows Polars to handle datasets larger than RAM, particularly with its streaming engine. But it comes at a performance penalty due to file IO generally being an order of magnitude slower than RAM IO.

Putting it all together: Polars architecture overview

At this stage we've covered all of the small incremental gains that Polars achieves, but it is also worth stressing that the architecture as a whole applies robust computer science:

The following diagram consolidates the end to end architecture we've described above:

flowchart TD
    subgraph PythonInput["Python Layer"]
        A[/"LazyFrame with Polars Expressions
(DSL)"/]
    end

    subgraph Rust["Rust"]

      subgraph IR["Polars: IR"]
          B["Query Plan
(AST-like structure)"]
          C{{"Schema Validation
• Data types at each node
• Early error detection"}}
      end

      subgraph Optimiser["Polars: Optimiser"]
          D["Optimisation Passes
• Projection pushdown
• Predicate pushdown
• Join optimisation
• Common subexpression elimination
• Function simplification"]
          E["Optimised Plan"]
      end

      subgraph Engine["Polars: Engine"]
          F{"Engine Selection"}
          G["In-Memory Engine"]
          H["Streaming Engine"]
          I["GPU Engine
(NVIDIA RAPIDS)"]
          J["Parallel Execution
• Multi-threaded (GIL released)
• SIMD instructions
• Work-stealing thread pool"]
      end

    end

    subgraph RAM["Output: RAM"]
        K[("Arrow Memory
(zero-copy to Python)")]
    end

    subgraph File["Output: File"]
        L[("Sink to Storage
(e.g. Parquet)")]
    end

    subgraph PythonOutput["Python Layer"]
        M[/"Polars DataFrame or Series object"/]
        ERR["❌ Error raised
before execution"]
    end

    A -->|"Python bindings
(same process)"| B
    B --> C
    C -->|"Types valid"| D
    C -.-|"Types invalid"| ERR
    D --> E
    E --> F
    F --> G
    F --> H
    F --> I
    G --> J
    H --> J
    I --> J
    J --> K
    J --> L
    K --> M

The Polars expressions you write in Python form a Domain Specific Language (DSL) that allows you to describe operations declaratively.

This DSL is translated into an Intermediate Representation (IR) - what Vink describes as similar to "an AST in Python"¹. You can inspect this using .explain() or .show_graph() on a LazyFrame.

The IR captures not just the chain of operations, but also the schema at each node. As Vink explains: "Polars knows the schema on any point in the lazy frame - on any node you can ask what the schema is."² This enables Polars to detect type mismatches and other errors before execution begins.

The plan is then passed to the optimizer, which seeks opportunities to reduce computation and data volume - projection pushdown, predicate pushdown, common subexpression elimination, and more.

Finally, the optimised plan is passed to the appropriate execution engine. As Vink notes: "you can have different engines, different backends for different data sizes because you have the distinction between the front end and the back end"¹ - whether that's the in-memory engine, streaming engine, or GPU-accelerated execution via NVIDIA RAPIDS.

The engine executes and the result is either returned to Python via Arrow-formatted memory (enabling zero-copy access), or written directly to storage formats like Parquet without materialising the full result in memory.

Performance in Practice: Research and Real-World Results

While theoretical advantages are important, what matters is real-world performance. Independent research and testing consistently show significant performance advantages for Polars across a wide range of operations.

Academic and Industry Research

A 2024 study by Felix Hänestredt et al. published in the Proceedings of the Evaluation and Assessment in Software Engineering (EASE) conference compared the energy efficiency and performance of various Python data processing libraries. Their findings confirmed that Polars significantly outperforms Pandas:

Polars consumed approximately 8 times less energy than Pandas in synthetic data analysis tasks with large dataframes
For TPC-H benchmarks (an industry-standard decision support benchmark), Polars was ~40% more efficient than Pandas for large dataframes

More recently, a benchmark test published on LinkedIn by Mimouned Jouallah. The benchmark involved processing 150 GB of CSV files and writing them to the Fabric lakehouse in Delta format. The benchmark was run on the smallest size of Fabric notebook with only 2 cores and 16GB of RAM. The results placed Polars ahead of DuckDB and CHdb.

Real-World Adoption

Beyond benchmarks, real-world adoption provides evidence of Polars' performance advantages. Organizations that have switched from Pandas to Polars frequently report:

Batch processing jobs completing in minutes instead of hours
Ability to process larger datasets without upgrading hardware
Reduced cloud computing costs (and energy consumption) for data processing pipelines
Faster inner dev loop for developers, reducing time to value

The growing adoption of Polars in production environments across various industries provides perhaps the strongest evidence of its performance benefits.

At endjin, our default choice for dataframe driven pipelines is Polars. We only revert to Apache Spark for datasets that are genuinely "Big Data" in scale. We find that Polars is optimal for at least 90% of use workloads.

Is there a commercial case?

We have not yet quantified the benefits, therefore we can't give you a concrete RoI.

However, we can say the positive impact of migrating to Polars is immediately apparent. Moving from Spark to Polars in production will typically reduces runtime and therefore the ongoing cost of running pipelines. But perhaps what is more apparent is that Polars unlocks the ability to develop and test locally - developers can use their laptops rather than spinning up extra capacity for development, which can often incur significant costs on Big Data platforms such as Databricks, Synapse and Microsoft Fabric. Developers are also more productive - test suites run in seconds not minutes. For example, when we migrated one workload from Spark to Polars our test suite for a specific use case reduced from 60 minutes to 9 seconds. This allowed us to extend our test suite and run tests more frequently - the result was more confidence in releases, and reduced time to value.

Conclusion: Performance by Design

Polars' exceptional performance is the result of deliberate architectural choices and careful implementation, applying best practices in computer science and decades of database research. By combining a Rust foundation, columnar storage, query optimization, parallel execution, and vectorized processing and support for the latest hardware features (SIMD), Polars delivers dramatic performance improvements over traditional DataFrame libraries.

What makes Polars particularly remarkable is how it achieves this performance while maintaining an elegant, user-friendly API. The complexity of the underlying engine is hidden behind a clean interface that focuses on expressing what you want to do rather than how to do it efficiently.

For data practitioners working with datasets that fit on a single machine, Polars represents a significant advancement in processing capability. It brings many of the optimization techniques previously found only in sophisticated database systems directly to the Python ecosystem, packaged in a form that's accessible to data scientists and analysts. It's like giving a data engineer superpowers!

Data practitioners who have traditionally worked with Apache Spark on platforms like Databricks, Azure Synapse Analytics, or Microsoft Fabric are discovering that workloads they've historically run on distributed Spark clusters can be handled efficiently by Polars on a single machine.

The advantages of adopting Polars in place of the PySpark SQL and Dataframe API are compelling: simpler architecture without cluster management overhead, faster iteration cycles during development, lower infrastructure costs, and the ability to run complex data processing pipelines on commodity hardware or even locally. While Spark remains essential for truly massive datasets that require distributed processing, Polars' combination of performance and simplicity makes it an excellent choice for the substantial portion of analytical workloads.

This is Part 2 of our Adventures in Polars series:

Part 1: Why Polars Matters - The Decision Makers Guide for Polars.
Part 3: Code Examples for Everyday Data Tasks - Hands-on examples showing Polars in action.
Part 4: Polars Workloads on Microsoft Fabric - Running Polars on Fabric with OneLake integration.

In our next article in this series, we'll show Polars in action.

Have you experienced performance improvements with Polars in your projects? What operations have you found particularly faster? Share your experiences in the comments below!

Polars: Faster Pipelines, Simpler Infrastructure, Happier Engineers

Barry Smart — Thu, 29 Jan 2026 05:31:00 GMT

TLDR; Polars is a DataFrame library written in Rust with Python bindings. We've migrated our own IP and several of our customers from both Pandas and Spark to Polars, and the benefits extend beyond raw speed: faster test suites, lower platform costs, and an API that developers actually enjoy using. It's open source, has zero dependencies, and can be deployed on a broad range of infrastructure options. If you're still defaulting to Pandas or reaching for Spark when datasets grow, it's worth understanding what this new generation of tooling can offer.

Why We're Writing This

Over the past eighteen months, we've been migrating our core data engineering IP - and helping a number of our customers do the same - from PySpark and Pandas based solutions to Polars. The results have been compelling enough that we felt it was time to share what we've learned.

At endjin, Polars is our default choice for DataFrame-driven pipelines. We reach for Apache Spark on the few occasions when data volumes genuinely require distributed compute.

This is a practitioner's perspective on a tool we've bet on, deployed to production, and would choose again.

What We've Seen in Practice

The headline benefits are significant. Here's what changed when we migrated from PySpark to Polars:

Metric	Before	After
Test suite execution	~60 minutes (Spark spin-up)	~30 seconds
Developer iteration cycle	Deploy to cluster, wait, check logs	Run locally, iterate, commit
Infrastructure model	Distributed compute on PaaS (e.g. Databricks)	Single-node commodity hardware
Monthly compute costs		>50% reduction (one customer)

Beyond the numbers, there's a qualitative shift that's harder to measure but equally important: developers spend less time fighting their tools and more time solving business problems.

The intuitive API and strict type system mean fewer runtime surprises. The lightweight footprint means local development works, giving developers access to modern IDE tooling and coding agents: a complete contrast to remote Spark environments and the frustrations of working in a web browser based UI. And because Polars is open source with zero dependencies, there's no vendor lock-in, no licensing complexity, and no heavyweight runtime to manage.

Independent research has measured the environmental impact too: Polars uses approximately 8x less energy than equivalent Pandas operations - something that matters increasingly to our clients with sustainability commitments.

It's time for organisations to seriously evaluate this new generation of data tooling. The technology has matured, the ecosystem is growing, and an increasing number of organisations are seeing order-of-magnitude gains from making the transition.

This blog aims to give you the information you need to determine whether deeper evaluation is merited for your organisation.

What is Polars?

As Polars creator Ritchie Vink puts it:

"Polars is a query engine with a DataFrame front end... it respects decades of relational database research."¹

This is a subtle but important distinction from describing it merely as a "DataFrame library", highlighting database-inspired optimization capabilities. At its core, Polars is designed to provide:

Lightning-fast performance - 5-20x faster than Pandas for most operations, with some users reporting up to 100x speedups in specific scenarios
Memory efficiency - dramatically reduced memory usage compared to Pandas
An expressive, consistent API - a thoughtfully designed interface that balances power with readability
Scalability on a single machine - making the most of modern hardware through parallelization and efficient algorithms

Where Polars Fits

To understand where Polars adds value, it helps to see the landscape it sits within.

The Two Migration Paths

We see organisations coming to Polars from two directions:

From Pandas: Teams hitting scaling limits. Datasets that used to fit comfortably in memory now cause out-of-memory errors. Operations that used to be fast enough now take minutes. The reflexive answer is "we need Spark" - but that's often overkill.

From Spark: Teams realising they're over-engineered. They're paying for distributed compute to process datasets that would fit on a laptop or a single small commodity compute node. They're waiting for clusters to spin up to run tests. The infrastructure complexity is slowing them down rather than enabling them.

Polars sits in the middle: powerful enough to handle datasets that break Pandas, simple enough that you don't need a platform team to run it.

The Data Singularity

There's a broader trend at play here. Hannes Mühleisen, co-creator of DuckDB, co-founder and CEO of DuckDB Labs and Professor of Data Engineering at the University of Nijmegen describes what he calls the "data singularity":

We are approaching a point where the processing power of mainstream single-node machines, [including laptops], will surpass the requirements of the vast majority of analytical workloads.

CPU core counts have increased dramatically. RAM is plentiful. NVMe storage offers throughput that would have seemed impossible a decade ago. But most data tools were designed before this shift, and they don't take advantage of it.

However, Polars does. It's designed from the ground up to exploit modern hardware: automatic parallelisation across all cores, efficient memory use, and algorithms optimised for contemporary CPU architectures.

This means datasets that "required" Spark five years ago can now run on a single node IF you adopt the right tooling - i.e. in process engines such as Polars or DuckDB.

What Makes This Possible

You don't need to understand the internals to benefit from Polars, but it helps to know why it's fast. Here's the short version:

Built for analytics - Polars is designed specifically for analytical workloads (aggregations, joins, transformations) rather than trying to be general-purpose. This focus drives design decisions throughout.
Automatic optimisation - when you write a Polars query, you're describing what you want, not how to compute it. Polars analyses your query and figures out the most efficient execution plan - reordering operations, eliminating redundant work, pushing filters down. You get database-grade query optimisation without writing SQL.
Parallelism by default - Polars automatically uses all available CPU cores. No configuration, no special coding patterns. On a 16-core laptop, that's a potential 16x speedup over single-threaded tools - and you get it for free.
Rust foundation - written in Rust (a systems programming language with C++-level performance), Polars has full control over memory layout and execution. There's no Python interpreter overhead in the hot path.

The practical upshot: you write readable, declarative code, and Polars makes it fast. As Polars creator Ritchie Vink puts it: "Write readable idiomatic queries which explain your intent, and we will figure out how to make it fast."

We explore the technical details in Part 2: What Makes Polars So Scalable and Fast? for those who want to go deeper.

Common Questions We Get

When we recommend Polars to clients, certain questions come up repeatedly. Here's how we answer them:

"Isn't Pandas 2.0 with PyArrow good enough now?"

Pandas has adopted Arrow for storage, which is a step forward. But the execution model is unchanged - Pandas still processes data single-threaded without query optimisation.

As Ritchie Vink has noted: "Pandas is using PyArrow kernels for compute... those are totally different implementations."² Adopting Arrow for storage doesn't give you Polars' query optimizer or automatic parallelisation.

Our benchmarks show Polars still outperforming Pandas 2.x by 5-10x on typical analytical workloads - and often more on complex queries where optimisation matters most.

Furthermore, because Pandas operates only in "eager" execution mode, we find we tend to hit out of memory limitations, switching to Polars and leveraging its "lazy" model of execution will often overcome that limitation allowing you to get more mileage out of existing infrastructure.

"Why not just optimise our Spark jobs?"

You can, and sometimes you should. But ask yourself: do you actually need distributed compute?

We've seen teams running Spark clusters to process datasets that fit comfortably in memory on a single node. The overhead of cluster management, network serialisation, and distributed coordination often exceeds the benefit.

If your data fits on one machine, Polars will almost certainly be faster and simpler. If it doesn't, Spark (or Polars Cloud) makes sense. The key is being honest about which category you're in - and many teams overestimate.

"What's the migration risk?"

Lower than you might expect, with caveats.

The main prerequisite is good test coverage. If you have comprehensive tests for your data pipelines, migration becomes a matter of swapping out the implementation and verifying the outputs match.

When migrating from Pandas, we have found that Polars' stricter type system often catches bugs that Pandas was silently propagating. The APIs share conceptual similarities, especially if you're already using method chaining in Pandas. The mental model shift is less about syntax and more about embracing lazy evaluation and expressions over imperative loops and .apply().

Migration from PySpark DataFrame API is more straightforward. The APIs are more similar. PySpark also uses lazy evaluation.

We've migrated multiple production systems successfully. The pattern that works: migrate incrementally, run both implementations in parallel initially, validate outputs match, then cut over.

"Is this mature enough for production?"

Yes. Polars has been in production use since 2021. There's now a company (Polars Inc.) providing commercial support and building enterprise features. The ecosystem has reached critical mass.

Perhaps more tellingly, Microsoft has included Polars in the default build for Fabric Python Notebooks - they're not betting on immature technology.

"What if we need to scale beyond a single machine?"

Polars has a streaming engine for larger-than-RAM datasets on a single node. For true horizontal scaling, Polars Inc. offers Polars Cloud, which distributes queries across multiple machines while maintaining the same API.

And because Polars is built on Apache Arrow, it interoperates cleanly with Spark if you need to hand off to distributed compute for specific workloads.

"Can we use AI agents to do the migration?"

Yes, with appropriate guardrails. AI agents such as GitHub Copilot or Claude Code could be used for the heavy lifting.

An AI assisted migration will be more effective if your legacy code base is well structured and documented. You will also likely need to document or provide examples of patterns that you want the AI agent to apply in the migrated code base. A suite of tests and "human in the loop" code reviews are essential.

We recommend doing the initial proof of value manually to understand the nuance and then capture learnings from that to be carried forward by the AI agent. For example, an AI is likely to carry forward the use of imperative loops and .apply() in Pandas into a Polars version of the same code, which is sub-optimal. By adding coding standards and examples into the context for the AI agent, you will encourage it to use Polar's rich expression language and therefore take full advantage of its Rust based engine.

Our Stack

Polars doesn't exist in isolation. A key part of our evaluation was whether it integrates with the tools our clients already use.

Here's what we've found works well in practice:

Visualization

Plotly - Plotly works cleanly with Polars. While Python charting libraries often expect Pandas DataFrames, Polars' zero-copy conversion to arrow makes passing data to plotting libraries efficient.
Altair - a declarative statistical visualization library that has excellent native support for Polars.

Data Validation

Pandera - a statistical data validation toolkit. Pandera allows you to define DataFrame schemas (including checks for data types and value ranges) and validate your Polars DataFrames at runtime to ensure data quality.

Interoperability

Apache Arrow -a Polars is built on top of the Arrow specification. This allows for Zero-Copy data exchange with other Arrow-based tools. You can pass a Polars DataFrame to pyarrow or other Arrow-consumers without duplicating the data in memory.
DuckDB + Polars - we often use these together. DuckDB can query Polars DataFrames directly via SQL without copying data. This lets us mix SQL (for complex window functions or ad-hoc exploration) with Polars expressions (for transformation pipelines) in the same workflow. See our DuckDB blog series for more details.
Narwhals - a DataFrame-agnostic API that lets library maintainers write code once and support Polars, Pandas, and other DataFrame libraries automatically. Pass in a Polars DataFrame, get a Polars DataFrame back. This is how libraries like Altair added Polars support without taking on Polars as a dependency—keeping them lightweight and broadly compatible.

Machine Learning

scikit-learn - Polars integrates well with the standard ML stack. You can pass Polars DataFrames directly to many scikit-learn models, or efficiently convert them to the required format for training.

Web Applications

Streamlit - popular framework for building interactive data apps. Streamlit has added native support for Polars, meaning you can pass pl.DataFrame objects directly to functions like st.dataframe() and st.line_chart() without manual conversion. We love Streamlit and have a series of videos on this topic such as: Getting Started with Python & Streamlit

The "Pandas Escape Hatch"

Pandas - the reality is that the Python ecosystem is vast, and you may find libraries that strictly require Pandas input. You can bridge this gap using .to_pandas().

Health Warning ⚠️: this converts your data into the Pandas format. This is an expensive operation that copies data in memory and forces eager execution. Doing this routinely undermines the performance and memory-efficiency benefits you chose Polars for in the first place! Use it only when strictly necessary.

Cloud platform

Polars Cloud - the commercial offering from Polars Inc. extends the open-source engine with serverless execution, horizontal scaling for partitioned data, and fault tolerance. Polars Cloud lets you take local Polars queries and run them at scale without managing infrastructure—using the same open-source engine under the hood.
Microsoft Fabric - Polars is part of the default build for Fabric Python Notebooks, providing an out-of-the-box experience for running fast analytics on the platform. With the flexibility to revert to Spark (which is available on Fabric PySpark Notebook) for the larger workloads.
Azure Container Apps - enable containerised deployment of Polars workloads onto Azure which can be triggered by Data Factory, Synapse or Fabric pipelines connecting to ADLS2 or OneLake storage.

Development environment

Visual Studio Code - we love VS Code! Polars is very suited to local development and works well with features such as dev containers and data wrangler to create a developer experience that feels like mainstream software engineering.

Getting Started with Polars

Ready to try Polars? The official documentation provides comprehensive guides, examples and API definition.

The Road Ahead

Polars is actively developed, with a clear trajectory:

Streaming Engine: Now available, this handles datasets larger than RAM by processing in batches and spilling to disk when needed - extending Polars' reach without requiring distributed compute.

Polars Cloud: The commercial offering from Polars Inc. adds horizontal scaling, fault tolerance, and serverless execution while maintaining the same API. Write queries locally, run them at scale.

GPU Support: Integration with Nvidia RAPIDS for GPU-accelerated processing on supported operations.

Growing Ecosystem: The plugin system allows custom extensions, and the community continues to build integrations with specialised tools.

We get a sense of an organically growing ecosystem with solid foundations - one that can scale across multiple dimensions (scale up, scale out, batch and streaming) as needs evolve.

Our Position

After eighteen months and multiple production migrations, Polars is our default choice for DataFrame-driven pipelines. We revert to Spark only when data volumes genuinely require distributed compute, which in our experience, is for less than 5% of use cases.

The technology has matured. The ecosystem is production-ready. The benefits compound over time as your team builds fluency with the expression API and lazy evaluation model.

Should you evaluate Polars? Consider your situation:

Strong fit:

You're hitting performance or scaling limits with Pandas
You're using Spark but suspect it's overkill for your data volumes
You value developer experience and fast iteration cycles
You want to reduce infrastructure costs and complexity
You're building new pipelines and want to start with modern tooling

Weaker fit:

Your existing pipelines are genuinely fast enough
You lack test coverage to validate migration correctness
You have deep investment in Spark-specific features (MLlib, GraphX, etc.)
Your data volumes genuinely require distributed compute

For most data teams, we believe Polars represents a significant opportunity to simplify infrastructure, accelerate development, and reduce costs - without sacrificing capability.

If you're still defaulting to Pandas out of habit, or spinning up Spark clusters for datasets that fit in memory, it's time to reconsider.

What's Next

This is Part 1 of our Adventures in Polars series:

Part 2: What Makes Polars So Scalable and Fast? - The technical deep-dive: lazy evaluation, query optimisation, parallelism, and the Rust foundation.
Part 3: Code Examples for Everyday Data Tasks - Hands-on examples showing Polars in action.
Part 4: Polars Workloads on Microsoft Fabric - Running Polars on Fabric with OneLake integration.

What's your experience with Polars? Are you evaluating it, migrating to it, or already using it in production? We'd be interested to hear - share your thoughts in the comments below.

Reactive Extensions for .NET Status and Plans for .NET 10

Ian Griffiths — Wed, 24 Dec 2025 18:30:00 GMT

Ian Griffiths, Technical Fellow at endjin, shares the latest updates on the Reactive Extensions for .NET (AKA ReactiveX AKA Rx.NET). Learn about the new features in Rx 6.1, what .NET 10 means for the project, and the significant packaging changes coming in Rx v7.0 that finally solve the long-standing deployment bloat issue.

In this talk:

Rx 6.1 New Features — DisposeWith operator for fluent CompositeDisposable usage, new TakeUntil overload with cancellation token support, and ResetExceptionDispatchState operator
The Bloat Problem Explained — Why self-contained Windows deployments were pulling in 90MB of unnecessary WPF and Windows Forms assemblies
Rx 7 Preview — How the new packaging model fixes bloat while maintaining source and binary compatibility
Community Contributions — Features from Chris Pullman (ReactiveUI), Neils Berger, Daniel Weber, and Adam Jones
Async Rx .NET — Status update and plans for a non-alpha release

Hi! Thanks for listening to this talk about what's been happening lately with the Reactive Extensions for .NET and what we've got planned. My name's Ian Griffiths. I'm a technical fellow at endjin, and my employer endjin maintains the Reactive Extensions for .NET. You can find the source code in this repo on GitHub.

If you're listening to this talk, it's likely that you already know about the Reactive Extensions, or as we usually call them, just Rx. But just in case you don't, here's a very quick introduction. Rx is an event-driven programming model. It's useful in any application where things happen. It provides a functional declarative programming style for writing code that responds to events.

This model has become popular in other programming languages, especially JavaScript, but it was originally a .NET technology invented by Microsoft. Rx .NET was one of the first projects that Microsoft open sourced, one of the first to come under .NET Foundation ownership, and it's been a community-supported project for over a decade now.

Today I'm gonna talk about things that have been, things that are, and some things that have not yet come to pass. More specifically, I'll talk about Rx 6.1, our most recent release. I'll also talk about what .NET 10 means for Rx and also for the related Ix project that lives in the same repository.

There's a new feature in the .NET 10 runtime class libraries that has a significant impact on us. And finally, I'll talk about our progress towards the next release, Rx version 7.0. To put this all in context, it's useful to know the recent history of Rx. endjin took over maintenance at the start of 2023.

This was a little over two years since there had last been a version since Rx five. The project had ground to a halt because its previous maintainers were no longer able to devote much time to it. Our first job was to bring the codebase back into line with current tooling, 'cause it had fallen behind and wasn't actually able to build on the current version of Visual Studio.

So we addressed that and then went on to add tests for newer versions of .NET, and then produced a new release, Rx version six. We spent a big chunk of the next year bringing the documentation up to date in the form of the free online ebook, Intro to Rx. We also spent a lot of time working out how best to solve a problem that I'll be talking about later, because that work will not ship until Rx 7.0.

So there wasn't a lot of visible progress. We had a couple of minor bug fix releases, but it wasn't until last month that we produced a release with any new features. And that's what I want to talk about first: Rx version 6.1, which shipped in October 2025. We've bumped the minor version number because there are three new features in Rx 6.1.

All of these arose from community input. Two were written by community members, and the third was based on community suggestions. So we have a new DisposeWith operator, which enables a fluent programming style when working with the CompositeDisposable type. This comes from Chris Pullman, a major contributor to the ReactiveUI project, and this extension fits in well with common coding idioms in that world.

To show how this works, I've got a program here based on one of the ReactiveUI examples. It's a very simple front end, and I can type search terms in here and it finds packages for me. This happens to be a WPF app, and as you can see, the main window here has this text box for me to type into and a list box that shows the result.

The basic idea with ReactiveUI is that we can represent user input as Rx observable streams, and we can also direct the output of observable streams into user interface elements. This code here is the basic logic for handling search input, and it's mostly standard Rx operators. This first method is from ReactiveUI and it lets me get an observable stream representing input to my textbox.

We're then using Rx's Throttle to ensure we don't perform searches too often while the user types, and this Where clause filters out empty inputs. And ultimately this delivers results into another property. And so that's one of the basic ideas of the ReactiveUI library. It uses Rx to define how information flows through the application.

Now if we look at the actual main window, this is where the code that you just saw gets hooked up to the user interface elements. You can see that this connects the application logic's search term input to the actual text box, and this connects the search logic's output to an actual list box. And this code here is where the new feature that we've added in Rx 6.1 comes in useful.

UI elements get created and torn down all the time. Entries in the search results, for example, appear when I type things and get replaced when I type something else. So each time some new UI opens up, not only do we need to run this sort of code to connect everything up, we also need to be ready to shut it all down cleanly. To enable that, this WhenActivated method passes me this argument, which uses Rx's CompositeDisposable type. Now that's been in Rx for years and it's just a collection of IDisposable objects, the idea being that they can all be disposed at once when needed. Any disposable objects I put in here will automatically be disposed when this user interface element goes away.

Right now I'm just calling Add to add things to that CompositeDisposable, and that's okay, but it's not quite the normal style for a ReactiveUI app. Normally we chain method calls together one after another in what's often called the fluent style. And if I show you another UI element, the one for an individual search result, you can see it's using that CompositeDisposable slightly differently.

Instead of wrapping each of these setup lines in a call to Add, I've got just one more fluent invocation on the end of it with this DisposeWith method. It's a small change, but it enables teardown to be handled slightly more neatly. And that's Rx 6.1's new DisposeWith feature. We also have a new overload of the TakeUntil operator.

This comes from Neils Berger and incorporates feedback from Daniel Weber, both members of the Rx community. Existing overloads enable a sequence to be observed until either an element matches some criteria or some other observable source completes. But this now enables a cancellation token to signal the instant at which the sequence should complete.

Finally, we have a new operator called ResetExceptionDispatchState, developed in response to feedback from Adam Jones, and this one is best shown by example. Rx has always offered this Throw operator. When you subscribe to the observable it returns, it immediately calls OnError, passing this exception. Since we construct just a single exception, it will reuse it for each subscription.

Normally that's not a problem, but there's one situation in which this can produce surprising behavior. But before I show you that situation, I want to demonstrate something about exceptions that has nothing to do with Rx and which you might not be aware of. I've got a different program here that creates a single exception object, and then each time around this loop, we wrap that in a task and await the task. That await will, of course, throw the exception, and this C# catch block just displays it. But watch what happens when I run this. Each time around the loop, my stack trace gets longer and longer. This is a feature of how the .NET runtime throws asynchronous exceptions. Each throw appends the current location to the stack trace, and normally that's what we want.

If an exception has traveled through multiple await statements before reaching a catch block, we want the stack trace to reflect that whole history, which is why the asynchronous rethrow appends new information to the existing stack trace. But this mechanism assumes that when the error occurred, something did actually use a normal throw operation.

And finally, there are a few changes that are minor, but which are technically breaking changes that have been waiting for the next major release. So we made a small change to nullability handling of the OfType operator to align it with some other LINQ implementations. And there are some behaviors that we consider to be unintended and where we think the fix will align with how people expect things to work, but technically it's a breaking change. And finally, we're gonna stop producing new versions of the old compatibility facades that were introduced in Rx version four.

But the big one is the fix for the bloat issues. We are going to introduce a significant change in Rx .NET packaging. It will only affect people building UI applications on Windows, though. I've got a console application with a Windows-specific target framework. It needs that because it invokes certain Windows APIs. The Program.cs here, I'm using the network availability API provided by WinRT, and I've wrapped it using Rx to provide a stream of notifications when the computer loses or acquires network connections. That's a Windows-specific API, but this is just a console app.

So this illustrates that just because you specified a Windows target framework, it doesn't necessarily mean you are building a classic desktop application with a user interface. But look at this build output folder. Now, I've configured this project to use self-contained deployment, meaning that it brings its own copy of the .NET runtime and any other components that it needs to run.

This application can be installed by simply copying this entire folder to the target machine. It does not need the .NET runtime to be pre-installed, but it's huge. Now part of that is simply that the .NET runtime is quite large, but that's not the whole story. This is much bigger than a self-contained console app would normally be.

Looking more closely, we see these PresentationFramework components. These are part of WPF, one of the Windows desktop frameworks that .NET offers. And a little further down, we can see that the other one, Windows Forms, is here too. These two UI frameworks are adding about 90 megabytes to the size of this folder.

We can mitigate this a little by enabling trimming. That will make the whole folder a great deal smaller, but even so, it ends up being many megabytes larger than it would've been if these UI frameworks weren't here. So why are they here? It's because of the unfortunate consequences of a design decision made back in Rx version four.

This decision is known as "the great unification," and it took us through a world in which Rx consisted of multiple components in which it was kind of unclear what they all did, and even more unclear which ones were required in which situations, and it took us into a world where there's exactly one component: System.Reactive. And at the time, this was great, but the decision to include UI framework support as part of this great unification has turned out to be problematic. The effect is that if you build an application with a target in which Windows Forms and WPF are available, Rx will provide its Windows Forms and WPF support, and that's a problem because to do that, it adds an implicit reference to the desktop UI framework, meaning that self-contained deployments get a copy of these UI frameworks even if the application itself isn't actually using them.

I'm now gonna upgrade this project to a preview build of Rx version 7.0 and rebuild it. Now the project appears to work exactly as before, and our goal is that for anyone who wasn't building WPF, Windows Forms, or UWP applications, that it all carries on working exactly as before.

So if you weren't running into this bloat issue, or perhaps because you weren't using self-contained deployment, then this shouldn't really affect you. But this example does use it, so let's look at the output folder. This is now much smaller. This is almost exactly the same size as a simple Hello World application, made slightly larger just by the presence of the System.Reactive assembly, but those PresentationFramework and Windows Forms assemblies are gone.

So you can see that the main effect has been achieved, but this has consequences for applications that actually are using the UI framework features. So here I've got a WPF project that is using Rx six, and this line of code here is using a WPF-specific feature. Specifically, this ensures that any events that emerge from this observable are delivered via the correct thread for this window.

I'm now gonna try upgrading this to Rx 7.0.

Now I've got compiler errors. To enable Windows projects to use Rx without being forced to have a dependency on the desktop UI frameworks, we've had to hide the relevant types. They are still actually present at runtime to ensure binary compatibility, but they are effectively invisible at build time. I could spend well over an hour explaining why this was the least bad available solution to the problem, but we don't have time for that in this particular talk. Anyway, notice that in addition to the error, we've got this diagnostic. Now we've added a code analyzer to Rx 7.0 that detects this exact situation and tells you what you need to do to fix it.

We've done this because we know people will get this error when they upgrade and it isn't entirely obvious what to do about it. So the analyzer tells me that I need to add a reference to the new System.Reactive.WPF NuGet package. So let's go back and find that. And when I add this, I'm back in business.

So this provides source-level compatibility as long as you add the new package reference. And as I mentioned, we do actually provide binary-level compatibility by using a trick with reference assemblies to hide these types just at build time. So the need to add a new package is slightly annoying, but our view is that UI framework-specific support should really always have been an opt-in feature.

So this gets us finally to the place where we want to be. Now you might be wondering why it's taken so long to get here. I've been talking about this for two years now. We first opened a GitHub discussion on this back in November 2023, and we did produce a prototype just a few months later to show what it would look like in the hope of getting some feedback.

Now, we didn't get a lot, so we announced in October 2024 that we were gonna move forward, but that did produce some negative community feedback. The plan we had at the time was somewhat more radical and would've been more disruptive. Its end state would've avoided the weird trick we've had to use with reference assemblies.

And it would also have enabled us finally to remove the UAP target framework from the main Rx package. But it would also have created a lot of problems for people who weren't actually affected by the issue we were trying to fix. So we had a rethink and came up with a new plan. The critical difference is that now System.Reactive remains as the main assembly, meaning that a lot of Rx users shouldn't even notice this change.

We also introduced an extensive set of tests designed to find the kind of subtle problems that emerge with any attempt to fix this issue. We called that test suite "Rx Gauntlet," and that is what's given us the confidence to move forward with this new plan.

One last thing: last year I talked about Async Rx .NET. We have made a little progress on extending the test suite, but we considered the code bloat issue with Rx .NET to be of higher priority. We will be getting back to Async Rx .NET, and we hope to have a non-alpha release next year.

My name's Ian Griffiths. Thanks for listening.

C# 14 New Feature: Implicit Span Conversions

Ian Griffiths — Thu, 11 Dec 2025 06:30:00 GMT

In this talk, Ian Griffiths dives into the new implicit conversions introduced in C# 14, designed to make span types more natural to use.

He discusses how this change enhances performance, simplifies method signatures, and enables more powerful extension methods. However, Ian also warns about potential compatibility issues with older libraries and provides advice for library authors. He concludes with technical examples and solutions to common problems caused by these new features. Essential viewing for C# developers looking to leverage spans in their code more effectively.

00:00 Introduction to Implicit Conversions in C# 14
01:17 Understanding Span Types and Their Benefits
02:35 Practical Examples of Using Spans
05:04 Limitations and Issues in C# 13
08:05 Improvements in C# 14
12:42 New Implicit Span Conversions
19:16 Potential Issues with Older Libraries
23:33 Conclusion and Final Thoughts

C# 14 defines some new implicit conversions that make the use of span types feel more natural. In fact, the feature proposal's original title was "First Class Span Types." So this new feature is a change that should mostly go unnoticed. Things should just work the way you would expect. However, there are a couple of good reasons to know about the details.

One is that if you are using class libraries that were designed before C# 14, this change might cause problems. There are some circumstances in which code that used to compile just fine needs modification in C# 14 because you're using a library that made some assumptions that are no longer true. The resulting errors can be baffling if you don't understand the language changes that cause them.

Another good reason to understand these changes in detail is if you're a library author—this new feature might change certain design decisions. And there's another reason, which is more indirect, but was actually the first thing that caused a problem in my work: when library designers changed their libraries to adapt to this new feature, those library changes can cause errors when you recompile your code. But we'll get to that.

Just in case you've not come across .NET span types, they were introduced back in 2017 with C# version 7 and .NET Core 2.1, with two main goals. One was to make it much easier to write high-performance code. These types are what enable high-performance libraries like System.Text.Json to be so much faster and more efficient than older libraries. At Endjin, we wrote the AIS.NET libraries using spans and it delivered performance several orders of magnitude better than other libraries available at the time.

The other thing that spans do is make it easier to deal with all the different places data might live. An ordinary array object has to live on .NET's garbage-collected heap, but .NET provides ways to create arrays that live inside the call stack. And if we're interacting with operating system APIs or external libraries, we might want to use data that lives in regions of memory that are outside of .NET's control.

A span can handle all three scenarios. This means that we can write a single method that can work with data wherever it happens to be. Whether the data is on the .NET heap, the stack, or some other block of memory, a single method that takes a span can deal with all of these.

Here's an example. To keep it simple, this reports just the length of the data. We could do more interesting things, but this new language feature mainly affects how methods are invoked, so it's the method signature that matters here.

This line makes a span for some data that lives on the stack. It builds an array without creating work for the garbage collector. And as this shows, I can also get a span for a string.

A crucial feature of spans is that they don't copy data. They refer to the data wherever it may happen to sit. So this span does actually refer to the garbage-collected heap because I started with a string, and .NET string objects always live on the heap. But this line here shows one of the efficiency features of a span.

I can slice it. Here I said I want a span that refers to just the first five characters of the string. So conceptually, this is just like calling Substring, but the big advantage a span offers here is that it doesn't create a copy. Whereas the string type's Substring method creates a whole new string object on the heap containing a copy of the part you wanted, this refers to exactly the same string data as the previous span. It just remembers that we're only looking at the first five characters.

And this shows that if I've got an ordinary array on the heap, I can turn that into a span too. Again, this doesn't copy it; it just lets me refer to the data that's in an ordinary array using a span.

This flexibility means that a single method that takes a span will work for strings, arrays, or data that lives outside of the heap, and it will support efficient slicing of data. If I had written this method to take an array, it wouldn't work on strings or on stack-based data. And if I wanted to slice the input up, I'd need to allocate a new smaller array with a copy of the data. So it often makes more sense to define APIs that take spans instead of ones that take arrays or strings.

It's also quite common to define extension methods for spans. So I could, if I wanted, implement this functionality as an extension method instead. As you can see here—now, you may have deduced from the file names that this is all C# 13. So if I can do all this in C# 13, why did C# 14 need to change anything?

So now it's time to look at some things that don't work in C# 13. When I passed an array to this first method, I first assigned it into a variable of type ReadOnlySpan of int, which shows us that an implicit conversion is available. So you might think I can just pass the array directly to the method, but I got an error.

It turns out that the compiler can't infer the right type argument. If I provide the type argument, then it works. So it's possible, but a bit inconvenient. And if I try to pass a string directly, it's the same story. A suitable implicit conversion exists, but I need to provide the type argument explicitly for the compiler to understand what I want.

What about the extension method? When I tried that, it doesn't work. The compiler error tells us that it's looking for a method accepting a first argument of type int array. In fact, the method will accept that if I call it explicitly, instead of as an extension method.

Again, I need to provide the type argument. So maybe that was the problem with the extension method, but if I supply the type argument while trying to use this as an extension method, it still doesn't work. So it has actually found the extension method, but it's telling me that for this code to work, the extension method requires a receiver of type array.

And this is true. The rules in C# for when a method can be used as an extension method are slightly more restrictive than the rules for normal method invocation. When you use normal method invocation, then if the argument types don't match, the compiler will look for user-defined implicit conversions, and that's what's happening here.

The span type defines a custom implicit conversion from arrays, and the compiler has used that to make this method call work. And if I open up the compiled code in ILSpy, and if I find the main method—that call to ShowLengthExt is the last thing I did—and if I scroll down to the end, you can see that it is indeed invoking the span-defined custom implicit conversion operator.

So why doesn't it do the same thing when I try to invoke the same method with the extension method syntax? Well, I said the rules for extension methods are more restrictive, and one of the differences is that C# won't look for user-defined implicit conversion operators to try to make an extension method invocation work.

So how does C# 14 change all of this? I'll keep the C# 13 version open on the left so you can see how things change in C# 14. Let's start with the first method, so not the extension method. If I'm using a stack-allocated array, I now don't need to assign that into a span variable first. I can just pass it directly.

This code only works on C# 14. Although in practice it's not a big win because actually there's a simpler way to achieve this. Visual Studio suggests that I use the collection expression syntax, which is a more compact way to do exactly the same thing as this code. And if I were to accept that change, the result would actually work in C# 13 as well.

So although the new language feature happens to enable this example, it's only interesting insomuch as it helps us understand what's changed. Now, although C# 14 does define a new implicit conversion from string to span, it doesn't actually affect the particular example I wrote here. I'll show the impact later, but it doesn't change this bit here.

But there is a useful improvement with arrays. In the C# 14 example, I've removed the span variable because now when I pass an array to a generic method that takes a span, I no longer need to specify the type argument to make it work.

But the main goal with this language change is actually to make extension methods work better. So now I can use the extension method directly on the array, and that's the most important thing that these language changes enable.

So what's actually changed? Well, what they didn't do was just allow extension method resolution to apply custom implicit conversions to the receiver. That might seem like it would've been the obvious thing to do if your starting point is this code—I've declared read-only local variables to make use of the custom implicit conversions that ReadOnlySpan defines.

So it's tempting to say that I want the compiler to just do that work for me and use these same conversions automatically for extension methods. But this would be a quite broad change with the potential to change the meaning of a lot of existing code. If the compiler suddenly starts considering using implicit conversions in cases where it never used to, previously correct code might become ambiguous, causing compiler errors.

It may well be possible to create rules that would enable this to work while avoiding such problems. But the C# team's goal here wasn't to enable a whole new way of using extension methods. Their goal was much more focused. They wanted to make it easier to use spans. Again, the feature proposal's original title was "First Class Span Support."

So although the actual mechanisms are some changes to the type conversion rules, the intention is absolutely focused on spans. This feels like a slightly unusual move to me. Normally, C# has tended to prefer generality. For example, the motivation for LINQ was better language-level support for working with databases, but the set of language features that enabled LINQ are all far more general in nature.

So it surprised me a little to see the C# designers choose a narrower mechanism when a more general one might have solved the same problem. Now having read through all the public discussions that I could find about this feature, it looks to me like the C# team wanted the language to embody the quite specific relationship between arrays and spans, and that they concluded that general-purpose features would either be unable to capture the nuances or would be excessively complex.

And I think that's why they didn't simply allow implicit conversions to apply to extension method receivers. That would be a very general mechanism, but it provides nowhere to build in any special understanding of how spans and arrays relate. That said, I still don't have a clear idea of any specific scenarios that would've turned out worse if they'd done the more general thing.

In the public discussions I found where people asked if they couldn't just make user-defined implicit conversions available for extension method receivers, the C# team's responses stated the design philosophy without giving specific examples of why it was better. But anyway, to make this work, they've added a completely new kind of built-in implicit conversion to the language specification.

In C# 14, we now have something called an implicit span conversion. The rules for implicit span conversions directly reflect the relationship between arrays and spans. So specifically we have conversions to Span of T for any array of type T. We also have conversion to ReadOnlySpan of T, but this goes further. This supports covariant conversion.

So for example, if you've got an array of strings, not only is that convertible to a ReadOnlySpan of string, it's also convertible to a ReadOnlySpan of object because there's an implicit reference conversion from string to object. Also, since string implements IComparable, this new language feature also makes an implicit conversion from array of string to ReadOnlySpan of IComparable, and likewise for any other interface that string implements.

This new language feature also defines covariant conversion from Span to ReadOnlySpan. These covariant conversions are examples of how this language feature embeds the similarities between spans and arrays.

The final new implicit conversion is from string to ReadOnlySpan of char. Now, if you're familiar with spans, you might be perplexed at this stage because the runtime library span types already define implicit conversion operators for some of these conversions.

And it's true that some of the cases where these new conversions apply used to work in C# 13 without any special compiler handling. The runtime library defines the Span and ReadOnlySpan types with various user-defined implicit conversions, and the compiler applies these in exactly the same way as it handles custom implicit conversions for other types.

So what then is the point of these language changes? The key is that these implicit conversions are now a different kind from ordinary custom implicit conversions, meaning the language can define special rules for them. For example, this new feature adds some new type inference rules specifically for the case where an array is supplied where either a Span or ReadOnlySpan is expected.

This is why in C# 14 we no longer need to provide a type argument when passing an array to a method that accepts a span. In C# 13, the only way I could pass an array expression to my span-based method was to supply the type argument. The type inference rules aren't able to determine that a user-defined conversion could work here.

But in C# 14, there's an explicit inference rule for when an array is passed where a span is expected. So when I do that here, it correctly infers that the type argument should be the array element type, and it can then apply the built-in conversion from array to ReadOnlySpan.

This new language feature also changes the rules for extension methods. These new implicit span conversions are now in the list of conversions that the compiler will consider when trying to decide whether an extension method is applicable. That's why I can call my span-based extension method directly on the array in C# 14. The new rules mean that when trying to resolve this method, the compiler will consider the new implicit conversions from arrays to spans, which means that my ShowLengthExt method is now a candidate, which it wasn't in C# 13.

So we have these new built-in implicit conversions and some changes to the rules for extension method resolution. This enables the most important feature: extension methods for spans now work like you'd hoped they would with arrays.

But there's more. As well as affecting method resolution, the changes also open up some new scenarios around user-defined conversions. As you may know, C# won't chain together an unlimited number of user-defined implicit conversions. If you define a type called Stringable with an implicit conversion to string, the compiler will let you assign a Stringable into a variable of type string, or pass it as an argument to a method accepting a string.

Now, as I already mentioned, it won't use this implicit conversion for the implicit receiver argument of an extension method, but user-defined implicit conversions are available for ordinary arguments. But if we then define another type that has an implicit conversion to Stringable, which I suppose we would call Stringableable, then although we can assign a Stringableable directly into a variable of type Stringable, and we can assign a Stringable into a variable of type string, we can't assign a Stringableable directly into a string.

The compiler won't string the conversions together, so to speak. Essentially you get to use just one user-defined conversion within an expression.

Now, before C# 14, the implicit conversions available for spans were implemented as user-defined conversions in the usual way by the span types, and these effectively used up your quota.

Let me show you what I mean. This class defines a custom conversion from ReadOnlySpan of object. Here I've defined an object array. Now in C# 13, I can assign that into a ReadOnlySpan of object because ReadOnlySpan defines a user-defined implicit conversion operator, and then I can assign that span into a variable of type WithImplicit because of this conversion operator.

So we've effectively assigned an array of objects into a WithImplicit variable by going through a couple of implicit conversion operators. But if I try to go there directly, it doesn't work. The compiler won't discover the chain of custom conversion operators required to get there. Each assignment effectively has to use just one implicit conversion at a time.

But in C# 14, I can go straight from the array to my type, and that's because the conversion to span is now a built-in implicit conversion, meaning that I'm only attempting a single user-defined conversion in this expression.

Earlier I mentioned that this change can cause problems when you use old libraries written before this feature was added. This can happen when the new built-in implicit conversions mean that method overloads that were not previously applicable now become candidates. Code that used to compile without problems can end up being ambiguous.

For example, xUnit.NET had a potential problem. Now, in fact, they fixed this before .NET 10 shipped. But if you're using older versions of their libraries, you could still encounter this.

So the problem was that these two overloads existed of the Assert.Equals method. Now suppose you wrote a test containing this code. In C# 13, this would use the first overload. But now in C# 14, the built-in implicit conversion from array to span means that these two methods are considered equally good candidates. And so this is ambiguous.

As I said, they have already fixed this, but if you're a library author, how would you fix these kinds of problems in your own code? Well, if you've got a situation like the one just described, you've got a couple of options. You could define a new overload that the C# compiler will consider to be a better match than either of the existing ambiguous options.

For example, in the xUnit example I just described, one way to resolve the problem would be to add an overload that accepts a ReadOnlySpan as its first argument and an array as its second. This is a perfect match for the code I showed, so the compiler no longer considers using the other two options because those are both less direct.

Another possibility is to use the OverloadResolutionPriority attribute. You can put this on a method to break ties in cases that would normally be ambiguous.

Now there's one more way in which this new language feature could cause you problems. And ironically, it occurs when a library has been modified specifically to provide better support for spans. Now, we at Endjin actually ran into this in the Reactive Extensions for .NET, an open-source project that Endjin maintains.

The .NET runtime libraries define two extension methods, both called Reverse. Now they're designed for different scenarios. One is a standard LINQ operator, and LINQ to Objects defines this for any type that implements IEnumerable of T. This is a non-destructive method. It doesn't change its target. Instead, it returns a brand new object which provides all the values from the source object but in reverse order. So it's just a view over the underlying source.

But the MemoryExtensions class also defines a method called Reverse as an extension method for Span of T. And this is an in-place operation that modifies its target.

Now a couple of unit tests in the Reactive Extensions code base included expressions like this. Now, our intention here was to get an IEnumerable of char that enumerates the characters in a string in reverse order. And before C# 14, this resolved unambiguously to the IEnumerable of T extension.

But remember, one of the main goals of this new language feature is to enable extension methods to work on expressions whose type is not a span but which can be implicitly converted to a span. So in this case, that makes the span flavour of Reverse a viable candidate.

Now, this is a somewhat unusual example because these aren't really overloads of the same thing. They're two completely different methods that happen to have the same name. So despite what duck-typing advocates would have you believe, names alone do not always fully identify methods.

Now it's easy to fix. We just insert a call to AsEnumerable to force the array into an IEnumerable of char and now the overload we want is an exact match, so the error goes away. This particular form of problem is likely to be fairly unusual because it's just not all that common to get this kind of name collision.

So there it is. Whereas we used to have to rely on the custom implicit conversion operators that the span types defined, the language now defines built-in implicit conversions. These are designed specifically to express the relationship between spans and arrays, providing more precisely targeted support than would otherwise be possible. And this also enables span-based extension methods to be more widely applicable.

My name's Ian Griffiths. Thanks for listening.

How .NET 10.0 boosted AIS.NET performance by 7%

Ian Griffiths — Tue, 09 Dec 2025 06:30:00 GMT

TLDR; .NET 10.0 increased performance in our Ais.NET library by 7% with no code changes. Performance is well over twice as fast as it was on .NET Core 3.1 when we first released this library. A Surface Laptop Studio 2 can process 10.14 million messages per second!

At endjin, we maintain Ais.Net, an open source high-performance library for parsing AIS messages (the radio messages that ships broadcast to report their location, speed, etc.). Each time a new version of .NET ships, we check it all still works and then run our benchmarks again. Each year, we've seen significant improvements:

So what about .NET 10.0? The short answer is that yet again it is significantly faster. For continuity I have run the benchmarks on the same desktop computer as when I first started benchmarking this library, meaning these figures are all directly comparable.

For the fifth year running, we're enjoying a free lunch! Without making any changes whatsoever to our code, our benchmarks improved by roughly 7% simply by running the code on .NET 10.0 instead of .NET 9.0. As with last time, we've not had to release a new version—the existing version published on NuGet (which targets netstandard2.0 and netstandard2.1) runs faster just as a result of upgrading your application to .NET 10.0.

Admittedly, this year's improvement is the smallest yet. But if you had asked me back in 2019 when we first wrote this library whether I'd expect to see each subsequent release of .NET make the library run faster and faster, with the aggregate improvement making the library run over 2.1x faster, I would have been sceptical.

Our memory usage is roughly the same. Our amortized allocation cost per record continues to be 0 bytes. The total memory usage including startup costs is very similar: a handful of kilobytes, depending on exactly which features you use.

Benchmark results

We have two benchmarks. One measures the maximum possible rate of processing messages, while doing as little work as possible for each message. This is not entirely realistic, but it's useful because it establishes the upper bound on how fast an application can process AIS messages on a single thread. The second benchmark uses a slightly more realistic workload, inspecting several properties from each message. Each benchmark runs against a file containing one million AIS records.

.NET 8.0

When I tested on .NET 8.0 in November 2023, I saw the results shown in this next table when running the benchmarks on my desktop. These figures correspond to an upper bound of 5.72 million messages per second, and a processing rate of 4.75 million messages a second for the slightly more realistic example. (The desktop I've run all these benchmarks on is now about 8 years old, and it has an Intel Core i9-9900K CPU.)

Method	Mean	Error	StdDev	Allocated
InspectMessageTypesFromNorwayFile1M	174.7 ms	2.20 ms	2.06 ms	4 KB
ReadPositionsFromNorwayFile1M	210.5 ms	4.15 ms	4.08 ms	5 KB

.NET 9.0

These were the numbers for .NET 9.0. The upper bound is 6.38 million messages per second, and the more realistic example processes 5.20 million messages per second.

Method	Mean	Error	StdDev	Allocated
InspectMessageTypesFromNorwayFile1M	156.7 ms	1.04 ms	0.97 ms	4.13 KB
ReadPositionsFromNorwayFile1M	192.3 ms	1.33 ms	1.18 ms	4.13 KB

I repeated these tests just now on the very latest .NET 8.0 and .NET 9.0 runtimes to check that my hardware setup hadn't changed in a way that was affecting performance. (We've been benchmarking all the way back to .NET Core 2, but I only repeat the measurements on runtimes still in support.) Within the bounds of experimental noise, the results were essentially the same. (That's what you'd hope, given that this is running on the same hardware, but the .NET runtime does get regular updates, so it's worth checking performance has remained the same on those versions. It's also important to check that I've not done something to my machine to change its performance. In fact, first time I re-ran these, I got slower numbers. Turns out I hadn't been as thorough as I meant to be when shutting down other processes to get the machine as close to idle as possible. So it was well worth repeating the measurements for older runtimes—otherwise I'd have been making .NET 10.0 look less good than it is.)

.NET 10.0

And now, the .NET 10.0 numbers:

Method	Mean	Error	StdDev	Allocated
InspectMessageTypesFromNorwayFile1M	148.1 ms	0.82 ms	0.77 ms	4.14 KB
ReadPositionsFromNorwayFile1M	179.3 ms	2.45 ms	2.30 ms	4.14 KB

This shows that on .NET 10.0, our upper bound moves up to 6.75 million messages per second, while the processing rate for the more realistic example goes up to 5.58 million messages per second. Those are improvements of 6% and 7% respectively from .NET 9.0. (I put the 7% figure in the blog title because that benchmark better represents what a real application might do. I've done this in previous years regardless of which of the two benchmarks showed the larger of the two increases.)

Surface Laptop Studio 2

You might be wondering where the 10.14 million messages per second figure in the opening paragraph came from. That's from running the same benchmark on newer hardware. I use my old desktop to get a consistent view of performance changes over time, but it understates what's possible on current hardware. Here are the numbers from my laptop (a Surface Laptop Studio 2 with a 13th gen Intel Core i7-13800H):

Method	Mean	Error	StdDev	Allocated
InspectMessageTypesFromNorwayFile1M	98.62 ms	0.542 ms	0.507 ms	3.93 KB
ReadPositionsFromNorwayFile1M	114.41 ms	0.753 ms	0.704 ms	3.93 KB

That gives us 10.14 million messages per second for the basic inspection, and 8.74 million messages per second with the more realistic workload.

Free performance gains over time

The bottom line is that just as moving your application onto .NET 9.0 may well have given you an instant performance boost with no real effort on your part (as did moving to .NET 8.0, before that, and .NET 7.0, before that, and .NET 6.0 before that) you may enjoy a similar boost upgrading to .NET 10.0.

We've been running these benchmarks across 7 versions of .NET now (.NET Core 2, .NET Core 3.1, .NET 6.0, .NET 7.0, .NET 8.0, .NET 9.0, and .NET 10.0) enabling us to visualize how performance has improved across these releases for our library. First we'll look at the time taken to process 1 million AIS messages:

(I've gone back to showing the figures for my aging desktop, to present a consistent history which is why this and the next graph don't show us breaking through the 10 million messages per second boundary.)

Each .NET release has brought improvements, so let's look at the version-over-version performance gains:

And next, the throughput in AIS messages per second (same benchmarks, just a different way to present the results):

Finally, let's look at the cumulative performance improvements since .NET Core 2.0:

Over the time the AIS.NET library has existed, performance has improved by well over double, thanks entirely to improvements in the .NET runtime!

Learn more about AIS.NET

You can learn more about our Ais.Net library at the GitHub repo, http://github.com/ais-dotnet/Ais.Net/ and in the same ais-dotnet GitHub organisation you'll also find some other layers, as illustrated in this diagram:

. ~3 million messages per second per core.">

Note that there is a separate repository for Ais.Net.Models. And there's another for the Ais.Net.Receiver project. If you would like to experiment with this library, you will find some polyglot notebooks illustrating its use at https://github.com/ais-dotnet/Ais.Net.Notebooks

C# 14 New Feature: Field-Backed Properties

Ian Griffiths — Wed, 03 Dec 2025 06:30:00 GMT

In this talk, Ian Griffiths explains how C# 14's new field-backed properties feature can save you from metaphorically falling off a cliff when you need more flexibility beyond automatic properties' basic functionality.

He demonstrates the use of this feature to customize property setters without losing the simplicity and support of automatic properties. By allowing you to refer to the compiler-generated field inside get or set methods, C# 14 reduces verbosity and maintains code clarity and organization.

Learn how this small but impactful enhancement can improve your C# coding experience.

00:00 Introduction to C# 14's New Feature
00:30 Understanding Automatic Properties
01:11 Customizing Property Behavior
03:06 Introducing C# 14's New Syntax
04:21 Benefits of the New Feature
05:33 Conclusion

C# 14 can save you from falling off a cliff with its new field-backed properties feature. Admittedly, the cliff is metaphorical. Sometimes when you're using a language or library feature, you can find yourself wanting to go beyond what that feature is able to support. And by moving outside the bounds of its support, you, so to speak, walk off the cliff and that sudden loss of support makes your life a lot harder.

Let me show you what I mean. I've got a very simple class here with a couple of properties. As you may know, this syntax where we use just the get and set keywords and optionally an accessibility modifier makes the compiler generate some code for us. It'll define a hidden field to hold the value and it supplies bodies for the get and set that use that field. The proper name for this is an automatically implemented property, but we typically shorten that to just auto property. This saves us from the tedious business of declaring a field and writing the obvious code to read and write the value in that field. It's not a huge deal, but if you're writing lots of properties, this offers worthwhile improvements in clarity and reduces work.

But what if we want slightly more than what C# generates for us? Notice this type defines an IsModified property. What if I want to set that anytime the Value property changes? Before C# 14, the only way to do that was to write a full property instead of an automatic property. Visual Studio can make that change for me. As you can see, this means declaring the field explicitly and having get and set accessors that use that field. Actually, Visual Studio doesn't quite get it right in this case because it hasn't noticed that the field name collides with the contextual keyword value inside the getter. So I need to qualify the field with a this reference.

And now this is almost identical to what the compiler was generating for me. I can now add in the extra feature that I wanted. So I'm just gonna adjust the layout and then use the full block syntax for the setter. And that gives me a place to put the code that sets the IsModified flag. Let's just run that and check that it worked.

And you can see that after I've set the Value property, the IsModified flag reflects that change as required. The obvious downside is that this is more verbose. It's not terrible. I can't complain about the fact that I've had to write the setter explicitly. The goal here was to customize that, but I've also got an explicitly implemented getter, which is effectively identical to what the compiler was generating, and I've also now got this field.

It's only a slight increase in clutter, but perhaps more concerning is the fact that it would be possible for other code in this class to use this field directly, bypassing my change detection. So the cliff wasn't a big one, it's a bit jarring, but this isn't a major problem. However, it comes up often enough that the C# team decided to support scenarios like this without forcing you to stop using automatic properties entirely.

In C# 14, I can leave the automatic get exactly as it is because I didn't actually wanna change that. Here I can customize just the setter. I can add this extra feature setting the IsModified flag. But how do I modify the value? Well, this is where I use the new syntax. Inside a get or set method, I can use the field keyword to refer to the compiler-generated field.

Let me just change the startup project and running that again, we get the required behavior. Comparing this to what we had to do before, you can see that this is a relatively small change. Before, I needed to declare my own field if I wanted to customize my property behavior, but now I can still get the compiler to generate that field for me.

Before, I had to write a custom getter, even though it was only the setter that I wanted to change. Now I can continue to use the compiler-generated getter. So this new feature has a fairly small impact, but what I prefer about the new code is that it removes clutter. I can see immediately that the getter does nothing out of the ordinary, so it's easy to see the one thing that makes this property slightly unusual.

The other benefit is I've not had to introduce a field. And while that's just one line of code, it's a line that doesn't sit all that well with some widely used .NET style guidelines that require fields to be declared in a separate part of the code from properties. Often the field and property could end up being quite distant, and when conceptually closely related code gets scattered across a file, it increases the work required to understand the code.

Arguably that's a flaw in the coding style guidelines, but for better or worse, it's a very common style in .NET that does provide benefits in some scenarios. Also, by using the compiler-generated field, I can be sure that the only code that modifies the field directly is this line here, and with the old approach, I'd need to search for other uses of the field to understand whether any code elsewhere in the class might be bypassing this change detection.

So in conclusion, C# 14 enables us to continue to enjoy the benefits of automatic properties, even when we move beyond their basic capabilities. So automatic properties have always given us a concise way to get basic property behavior, but now if we want to extend beyond that behavior, we can do so without having to fall off the cliff of support.

My name's Ian Griffiths. Thanks for listening.

C# 14 New Feature: Script Directives

Ian Griffiths — Fri, 28 Nov 2025 06:30:00 GMT

C# 14 introduces new directives that transform C# into a true scripting language. In this video, Ian Griffiths explains how the .NET 10 SDK now lets you run a single C# source file directly—no project file required.

What you'll learn:

How to run C# files directly with dotnet run
The shebang (#!) directive for Unix/Linux script execution
The new #: directive for adding NuGet packages and build properties
How these features build on the low-ceremony entry points introduced in C# 9

Contents:

00:00 Introduction to C# 14 Scripting Capabilities
00:10 Running C# Source Files with .NET 10 SDK
00:32 Simplifying C# Program Structure
01:47 Shebang Syntax for Unix Systems
04:05 Ignored Directives in C# 14
04:44 Using External Libraries in C# Scripts
07:38 Conclusion

C# 14 has added new directives that enable C# to be used as a scripting language, starting with the .NET 10 SDK. The .NET run command now supports being run against a single C# source file, so you no longer need a project file. In this folder, I have just a single C# source file. There's no solution file and no .csproj, just this code. I can supply the file name to the .NET run command, which compiles and runs it. This is a natural progression from the C# 9 feature that enabled us to write the program entry point in this simple way. So whereas once we used to have to declare a containing method and type, now we can omit all of that and write just the statements that run when the program starts, and the compiler fills in the rest.

The basic principle was that we shouldn't have to write a load of boilerplate if the compiler can generate suitable defaults. And now in .NET 10, that same idea extends to the project file. If that was only going to contain the defaults, why not get rid of it completely? So now, for the first time, a complete and runnable C# program can live in a single source file.

So although C# continues to be a compiled language, these changes enable us to use it much more like a normal scripting language. The combination of the feature added back in C# 9 that let us write these low-ceremony entry points, in conjunction with the new .NET 10 SDK support for running a C# source file without needing a project file, enables us for the first time to put an entire C# program into a single runnable source file with no supporting files required. However, for this to be useful, we need some more things. On Unix operating systems, the convention is that a script file begins with what's sometimes known as the shebang syntax. So the idea here is that the very first line of the script will begin with a hash character followed by an exclamation mark, and the operating system detects that well-known starting sequence. It goes, "Ah, you're using a shebang. Right, the rest of that line is going to tell me which program I need to launch to execute your script successfully." So you see this in most shell scripts. You also see it when people are using Python files as scripts, and we need to do the same thing in C#.

I am running in a container here so that I can use Linux, which recognizes the hashbang convention. This particular container has the .NET SDK installed, and you can see this hello.cs file has execute permission set. In the file, we can see that it begins with a shebang line, and that is enough for me to be able to run this.

When I execute this file, the operating system sees that it begins with the hash and exclamation mark, so it knows that it has to run the command that follows, passing in the file path as an argument. So the effect is that this single C# source file becomes a runnable program. Now you do need the .NET SDK to be installed for this to work because the .NET run command has to compile the file to be able to run it, so this won't work if you've installed only the .NET runtime. And this is mostly an SDK feature. It is the .NET run command that's doing all the work here. The C# compiler just ignores that hashbang directive, and if you go and find the language specification for this new feature, its title is "ignored directives." So as far as the compiler's concerned, these things aren't really any different from a comment, but whereas a comment is designed to be read by other developers, the audience for this directive is the operating system. So it's there because Unix-style operating systems recognize this as a standard start to a script file, and they know what to do with it.

Now, C# 14 also defines another kind of ignored directive. You can also put lines that begin with a #:. Now again, the compiler just does nothing with these—it treats them in much the same way as it does the #! syntax. But these are intended for a different audience. So rather than being directed at the operating system, the #: lines are there for the benefit of the .NET run command. And these enable you to control aspects of how the source file is built that might otherwise have required you to add a project file.

For example, suppose we want to write a C# script that uses external libraries. I've got a script here that discovers the latest available version of a particular NuGet package. Specifically, it's looking at the latest version of System.Reactive, the main component of the Reactive Extensions for .NET, which my employer Endjin currently maintains, by the way. So to do this, my script uses the NuGet client SDK. Now that is not built into the .NET runtime. The NuGet client SDK is itself distributed via NuGet.

So to use it in my C# code, normally I'd expect to add a package reference in my .csproj file, but this is a script and the whole point is that it's self-contained. There is no .csproj file, so I need some way to tell the .NET run command that my code needs to use a NuGet package. And you can see that the second line of this script is a #: directive, and the dotnet run command searches for these in the file, and it expects the colon to be followed by text indicating what we're asking it to do. For this line, the word "package" tells the tool that this code uses a particular NuGet package, specifically the NuGet package that provides the NuGet client SDK that this code uses. Notice there's a second #: directive, this time with the text "property." This lets me set build properties. It turns out that the NuGet client SDK relies on reflection-based JSON serialization, a feature that's disabled by default for scripts, which would cause this program to fail with an exception.

But this directive tells the dotnet run command that when it builds our script into executable code, it should act as though I had a .csproj file with a property group setting this property to true. That enables the feature that the NuGet client library requires, and so the script just works.

So from a C# perspective, this is a really simple feature. It's just two new directive types, both of which the compiler completely ignores. This works and it's useful because the .NET SDK is now able to build and run source files directly. So the two directives: the first one is for the operating system's benefit—it enables us to put a #! mark on the first line of code, and Unix-like operating systems recognize that and it enables them to go and find the dotnet run command in order to execute the script. And then the second directive type enables us to provide instructions to the .NET run tool that might otherwise have required us to add a project file.

My name's Ian Griffiths. Thanks for listening.

Adventures in Least Privilege: When an owner isn't an owner

James Dawson — Thu, 27 Nov 2025 06:30:00 GMT

There are often situations where we need to automate the configuration of Microsoft Entra ID resources as part of our infrastructure deployments. When doing so with a least-privilege mindset, we carefully select the minimum permissions required for our automation to succeed. Or at least that's the theory and as I discovered recently, the relationship between App Registrations and Service Principals has a subtle gotcha that can break even the most carefully planned least-privilege automation.

Let me walk you through a troubleshooting journey that taught me more about some of Entra ID's behind-the-scenes activities than I bargained for.

The Setup

Picture this: you're deploying a Streamlit application on Azure Container Apps with Entra ID-based authentication and authorisation. The entire deployment is automated using Infrastructure-as-Code, including the Entra ID configuration itself. To keep things secure, you're using a user-assigned managed identity with carefully selected Microsoft Graph permissions:

Application.ReadWrite.OwnedBy - Create apps (& service principals) and manage those we create
Directory.Read.All - Read directory information for lookups
Group.Create - Create security groups; also enables the option of setting ourselves as owners (but only at creation time!)

These permissions are utilised through PowerShell ARM Deployment Scripts, and everything seems perfectly scoped. The Application.ReadWrite.OwnedBy permission is particularly elegant as it allows our managed identity to create App Registrations and automatically become their owner, without requiring the overly broad Application.ReadWrite.All permission.

The deployment script creates the App Registration, defines the necessary app roles, and configures the authentication settings. So far, so good.

NOTE: In case you're wondering why we're not using the Bicep Graph Extension, there are some subtleties in its current behaviour that are incompatible with a least-privilege mindset; for example, this GitHub issue.

The Problem

Then comes the moment to assign groups to the app roles exposed by our newly created app registration, enabling user authorisation within the Streamlit app. This is where things went wrong.

Error: Insufficient privileges to complete the operation

Wait, what? We own the App Registration. We can see our managed identity listed as an owner in the portal. We should have full control over this resource. Time to investigate.

Down the 'Missing Permission' Rabbit Hole

All good troubleshooting sessions must send you down a rabbit hole at some point!

The first instinct is to check whether we need an additional Microsoft Graph permission. After all, we're trying to perform an app role assignment operation. Perhaps we need AppRoleAssignment.ReadWrite.All?

But here's the problem: AppRoleAssignment.ReadWrite.All is what Microsoft classifies as a "Tier 0" permission; one of the highest privilege levels in the entire tenant. According to the Microsoft Graph permissions reference, this permission allows an application to:

Grant additional privileges to itself, other applications, or any user

That's essentially the keys to the kingdom. An application with this permission can escalate to ANY permission in the tenant by assigning itself app roles on Microsoft Graph or other resources. This is precisely the kind of broad privilege we're trying to avoid.

Surely there must be a least-privilege variant, similar to how Application.ReadWrite.OwnedBy exists for application management?

There is no AppRoleAssignment.ReadWrite.OwnedBy permission (as we'll see, it's not necessary). Microsoft Graph API permissions are tenant-wide and cannot be scoped down to specific applications. This is a deliberate security design decision - the ability to assign arbitrary app roles is simply too powerful to scope safely.

For a security-conscious deployment following least-privilege principles, granting AppRoleAssignment.ReadWrite.All just to assign roles to our own application is unacceptable. There must be another way.

The First Clue

While checking the App Registration in the portal, something catches my eye. Under Enterprise Applications, there's a Service Principal for our application. But curiously our deployment script does not include creating a Service Principal, so where did it come from?

Let's check the owners of this Service Principal. And there it is, our managed identity is NOT listed as an owner. Despite owning the App Registration, we don't own the corresponding Service Principal.

This is the smoking gun.

Applications & Service Principals: Two Objects, Two Ownership Models

Let's step back and understand what's really happening here. Microsoft Entra ID has a concept that's easy to overlook if you primarily work through the Azure Portal: Application objects and Service Principal objects are separate entities with independent ownership.

As explained in Microsoft's App objects and service principals documentation:

Application Object (App Registration):

The globally unique definition of your application
Lives in your "home" tenant
Acts as a template or blueprint

Service Principal Object:

The local representation of an application in a specific tenant
A concrete instance derived from the application object
Required for the application to actually authenticate users or access resources as itself

One Application object can have many Service Principals (one per tenant where it's used in multi-tenant scenarios), but in single-tenant applications, there's typically one Service Principal in your home tenant.

The Portal's Helpful Deception

When you register an application through the Azure Portal, both the Application object AND its Service Principal are created simultaneously, and you automatically become owner of both. This convenience feature hides the fact that these are separate operations.

The API's Honest Truth

When using the Microsoft Graph API (which is what our ARM Deployment Script does under the hood) creating an Application object and creating a Service Principal are separate, explicit operations.

When you create an App Registration using the Graph API with Application.ReadWrite.OwnedBy permission, you automatically become owner of the Application object. However, the Service Principal must be created separately. If you don't explicitly create it, certain operations (like first consent or first app role assignment) will automatically create one for you.

And here's the gotcha: when a Service Principal is automatically created, it does NOT inherit ownership from the Application object or the user that triggered the implicit creation.

This is documented behaviour in the API documentation, though it's easy to miss the implications if you're not specifically looking for them.

In our case, the web app hadn't yet been accessed, so it was the 'first app role assignment' that triggered it; but it's interesting to learn that there are two scenarios for this implicit creation.

Why This Design?

This isn't a bug, it's a deliberate security boundary. Service Principals represent actual runtime identities in your tenant. They're what authenticate, what hold credentials, and what get assigned permissions; therefore ownership of them has significant security implications.

The principle here is explicit intent. The API requires you to explicitly state "I want a Service Principal for this Application" to prevent accidental sprawl of identities and ensuring proper ownership assignment.

The Solution: Embrace Explicit Creation

Once we understand the problem, the solution becomes clear. Our ARM Deployment Script needs to explicitly create the Service Principal, not rely on automatic creation.

The approach is straightforward:

Create the App Registration (we automatically become owner)
Explicitly create the related Service Principal (we automatically become owner)
Now we can perform app role assignments because we own both objects

The beauty of this approach is that it works perfectly with Application.ReadWrite.OwnedBy. According to the permissions documentation, creating a Service Principal requires either:

Application.ReadWrite.OwnedBy (least privileged)
Application.ReadWrite.All

We already have the least-privilege permission we need.

Cleaning Up the Orphaned Service Principal

Before implementing the fix, we need to delete the automatically created Service Principal that our managed identity doesn't own or manually add it as an owner. Since we can't manage it with our current permissions (we don't own it), this requires either:

Using a Global Administrator account
Using the Application.ReadWrite.All permission temporarily
Using an account that has the appropriate Entra ID role (Application Administrator or higher)

Once cleaned up, update our deployment script to explicitly create the Service Principal immediately after creating the App Registration. When we run the updated script, both objects will be created with the managed identity as owner, and app role assignments will work as expected.

Lessons Learned: Principles for Least-Privilege Entra ID Automation

This troubleshooting journey reinforced several important principles for anyone automating Entra ID configuration.

1. Understand the Dual-Object Model

App Registrations and Service Principals are distinct objects. The portal's convenience features can mask this reality. When working with APIs or automation, treat them as separate entities that both require explicit management.

Ref: App objects and service principals

2. Application.ReadWrite.OwnedBy is Powerful but Has Boundaries

The Application.ReadWrite.OwnedBy permission is excellent for least-privilege automation. It allows you to:

Create applications (you automatically become owner)
Create service principals (you automatically become owner)
Fully manage resources you own
List all applications and service principals in the tenant

However, it does NOT allow you to:

Manage applications or service principals you don't own
Be an owner of Service Principals created by automatic processes

Ref: Microsoft Graph permissions reference

3. Explicit is Better than Implicit

Following Python's zen, explicit is better than implicit. In infrastructure automation, this is doubly true. Be aware of when automatic object creation can happen, so you can avoid relying on it (unwittingly or otherwise) and create them explicitly in your deployment scripts. This ensures:

Predictable ownership assignment
Clear audit trails of what was created when
Proper permissions from the start
No orphaned resources

4. AppRoleAssignment.ReadWrite.All is Too Privileged

While it might seem like an easy solution to app role assignment problems, AppRoleAssignment.ReadWrite.All is classified as Tier 0 privilege. It can be used to escalate to any permission in the tenant. Reserve this permission only for scenarios where it's absolutely necessary, and never use it when a more scoped approach exists.

5. The Portal and API Have Different Behaviours

The Azure Portal provides convenience features that streamline common workflows. This is excellent for interactive use, but automation requires understanding the underlying API behaviour. What happens automatically in the portal often requires explicit steps in API-based automation.

6. Consider Owners in Your Automation

When creating Applications & Service Principals, you can add additional owners using the Add application owner & Add service principal owner APIs respectively. This requires Application.ReadWrite.OwnedBy, although Directory.Read.All is often useful if you need to look up identities by name (as the APIs require the 'ObjectId', also known as the 'PrincipalId').

Consider whether your automation needs multiple owners for lifecycle management.

Closing Thoughts

The beauty of Infrastructure-as-Code and automation is that once you understand the correct pattern, you can codify it and never encounter this problem again. The pain of troubleshooting becomes the foundation of better practices.

Microsoft's security model for Entra ID is sophisticated and well-thought-out. The separation between Application objects and Service Principal objects provides important security boundaries. The challenge is that this sophistication isn't always obvious, especially when the Azure Portal's convenience features smooth over the rough edges.

For those of us building automated deployments with least-privilege principles, understanding these nuances isn't optional — it's essential. The difference between Application.ReadWrite.OwnedBy and Application.ReadWrite.All might seem subtle, but it represents the difference between scoped, secure automation and unnecessarily broad permissions.

I hope this troubleshooting journey saves you the hours I spent tracking down this particular gotcha.

Have you encountered similar gotchas in Azure or Entra ID automation? Leave a comment below, or ping me via Bluesky @jdawson.bsky.social.

Ix.NET v7.0: .NET 10 and LINQ for IAsyncEnumerable

Ian Griffiths — Wed, 26 Nov 2025 05:30:00 GMT

We've just released a new version of the Interactive Extensions for .NET (Ix.NET). Now that .NET 10.0 offers built-in support for LINQ to IAsyncEnumerable, it's time for Ix.NET's System.Linq.Async to step back.

Why you might need to upgrade

If you've been seeing errors of this kind since .NET 10 shipped:

error CS0121: The call is ambiguous between the following methods or properties: 'System.Linq.AsyncEnumerable.Select(System.Collections.Generic.IAsyncEnumerable, System.Func)' and 'System.Linq.AsyncEnumerable.Select(System.Collections.Generic.IAsyncEnumerable, System.Func)'

you may need to upgrade to v7 of Ix.NET's System.Linq.Async package. (In the long run you will want to stop using it entirely, and use the .NET runtime library System.Linq.AsyncEnumerable package instead, but if you've ended up seeing these errors because of an indirect dependency, you might not be able to remove the reference just yet, in which case you'll need to upgrade it instead.)

In most cases, that will solve the problem. It's possible you'll also need to add a reference to System.Interactive.Async v7 (or upgrade an existing reference to that version). There are also some more complex scenarios. This post explains what has changed and why.

.NET 10.0 and LINQ to `IAsyncEnumerable`

The main reason for this new Ix.NET release is that .NET 10.0 now implements a feature that used to be part of Ix.NET: LINQ for IAsyncEnumerable.

For years, if you wanted to use the standard LINQ operators with IAsyncEnumerable, you had to use the System.Linq.Async library. Despite its name, this was not part of the .NET runtime class libraries, and was not maintained by Microsoft. That library was originally produced by the Rx.NET team as part of the Ix.NET libraries. This occurred after Microsoft stopped work on Rx.NET, so System.Linq.Async was always a community-maintained library. (See the History section for the reasons behind this.) But .NET 10.0 now supplies the functionality that System.Linq.Async was originally written to provide.

The .NET team did consider just taking over ownership of System.Linq.Async, but decided instead to reimplement LINQ for IAsyncEnumerable from scratch. This was partly motivated by the fact that the old System.Linq.Async package predates some library design guidelines, and made some naming choices that do not align with current practice.

This new implementation lives in an assembly called System.Linq.AsyncEnumerable, which is built into .NET 10. It is also available for use on older verions of .NET (including .NET Framework) through the System.Linq.AsyncEnumerable NuGet package. This provides a complete implementation of LINQ for IAsyncEnumerable.

What this means for developers

Anyone writing new code that targets .NET 10 can use the standard LINQ operators on any IAsyncEnumerable without needing to add any NuGet packages. But where it gets a little more tricky is if either:

you were already using the old System.Linq.Async and have upgraded to .NET 10.0
you end up with a transitive dependency on the old System.Linq.Async

That second one will be quite common because System.Linq.Async is a widely used package. You might be using it without ever having asked for it.

(Note that it's also possible to hit problems without even upgrading to .NET 10.0. The problems that occur in these two scenarios aren't really to do with being on the .NET 10.0 runtime: they happen because System.Linq.Async v6 clashes with System.Linq.AsyncEnumerable. .NET 10.0 includes System.Linq.AsyncEnumerable 'in the box' so an upgrade to .NET 10.0 is likely to be the most common reason for encountering this clash. But it can also happen on .NET Framework or .NET 8.0 or .NET 9.0, because you can use System.Linq.AsyncEnumerable on those runtimes. The new System.Linq.AsyncEnumerable is built into .NET 10.0, but it's available for use via NuGet on those older platforms.)

The problem here is that you can end up with two implementations of LINQ for IAsyncEnumerable: the old System.Linq.Async (Ix.NET) and the new System.Linq.AsyncEnumerable (.NET runtime libraries). When two implementations of every standard LINQ operator are available for IAsyncEnumerable, the compiler emits error CS0121: The call is ambiguous messages any time you try to use them.

In most cases the fix is simple: upgrade to System.Linq.Async v7. (If you were using it directly, this just means upgrading your existing package reference to the latest version. If you have ended up with an indirect reference through some other package you'll need to add a new reference to the latest version of System.Linq.Async.) In some cases, for reasons described later, you might need to add a reference (or upgrade an existing reference) to System.Interactive.Async v7.

`System.Linq.Async` will be deprecated

V7 of System.Linq.Async provides a quick fix for the compilation errors that developers may encounter in .NET 10, but the longer term solution is for everyone to stop using System.Linq.Async. Its only purpose was to provide LINQ for IAsyncEnumerable, and now that the .NET runtime libraries supply this through System.Linq.AsyncEnumerable, there is no longer any reason for Ix's System.Linq.Async to exist.

So Ix's System.Linq.Async is now a legacy component that exists purely for backwards compatibility reasons. If you're writing an application that has ended up depending on System.Linq.Async you might not be able to get rid of that dependency—you'll have to wait until the authors of the libraries that depend on it stop using it. But if you have only a direct dependency on System.Linq.Async, you should stop using it, and should switch to System.Linq.AsyncEnumerable instead. (You may also need to add a reference to System.Interactive.Async for reasons described later.)

We will be deprecating the System.Linq.Async package to encourage people to move off it.

However, if you do this, you may discover that the .NET runtime's System.Linq.AsyncEnumerable is not an exact replacement. There are two issues:

Some methods have been renamed because naming conventions changed since Ix.NET first provided IAsyncEnumerable LINQ support
The new System.Linq.AsyncEnumerable has omitted some of the functionality that Ix supplied

An example of the first kind of issue occurs with operators that take callbacks, such as Where. When filtering an IAsyncEnumerable, you might want to use a normal Func just like you would with IEnumerable, but since you're in an asynchronous world, you might actually want to provide an async callback, and you might want that to support cancellation. To support that, Ix's System.Linq.Async offered not only Where, but also WhereAwait and WhereAwaitWithCancellation.

This same functionality exists in .NET's System.Linq.AsyncEnumerable, but there are two important changes:

the async callback overloads have the same name as the normal ones (e.g., instead of WhereAwait and WhereAwaitWithCancellation, we now have just overloads of Where)
the async callback overloads require the callback to accept a CancellationToken (which the callback is free to ignore)

If you're using these async callback operator forms today and you upgrade to v7 of System.Linq.Async, you will see warnings of this kind:

warning CS0618: 'AsyncEnumerable.WhereAwait(IAsyncEnumerable, Func>)' is obsolete: 'Use Where. IAsyncEnumerable LINQ is now in System.Linq.AsyncEnumerable, and the WhereAwait functionality now exists as overloads of Where. You will need to modify your callback to take an additional CancellationToken argument.'

We continue to provide these old methods, but we provide a deprecation warning to encourage you to move onto the new equivalents provided by the .NET runtime libraries.

Note: if you're using any of the old methods with Await in their name, you will need to do more than just using the new method name, because the new .NET runtime library implementations require your callback to take a CancellationToken. If you don't add this extra parameter to your callbacks, you will get errors of this form:

error CS4010: Cannot convert async lambda expression to delegate type 'Func'. An async lambda expression may return void, Task or Task, none of which are convertible to 'Func'.

This is not a very helpful message because it doesn't explain what you need to do. (Note that the deprecation method does tell you what you need to do, but if you didn't read that all the way to the end, you will have missed the part that saves you from this error.) If you had this:

IAsyncEnumerable evens = GenerateNumbersAsync(10)
    .WhereAwait(async x => x % 2 == 0);

and after reading about two thirds of the deprecation warning you changed it to this:

IAsyncEnumerable evens = GenerateNumbersAsync(10)
    .Where(async x => x % 2 == 0); // CS4010 error on this line

you'll get that error above. You need to modify the lambda to accept an additional argument:

IAsyncEnumerable evens = GenerateNumbersAsync(10)
    .Where(async (x, _) => x % 2 == 0);

The change here is that instead of the single x parameter, we now have a parameter list: (x, _). That underscore indicates that we don't actually want to use our second argument. (It's a discard.) That's OK, but we still have to accept the argument, because the new System.Linq.AsyncEnumerable (.NET runtime) implementation of Where does not support the single-argument callbacks that System.Linq.Async (Ix) did. This is not a change in fundamental capability: you're not obliged to do anything with that cancellation token. But it does mean you need to change more than just the name of the method you're calling.

There's a further trap for Select. If you had code like this:

xs.Select(async v => ...)

it's not enough to do this:

xs.Select(async (v, _) => ...)

because the compiler can't tell whether you mean the overload that accepts a Func>, or the overload that takes a callback which receives an extra parameter indicating the index of the value, which is of the form Func. (The basic issue here is that in its standard form, Select can accept either 1- or 2-argument projection callbacks. This means the additional cancellable forms can create ambiguity.) So you need to make it clear which overload you mean by specifying the argument types. For example if xs is an IAsyncEnumerable you can write this:

xs.Select(async (int v, CancellationToken _) => ...)

Note that this ambiguity has nothing to do with the transition from Ix's System.Linq.Async to .NET's System.Linq.AsyncEnumerable. You can run into exactly this error if you create a brand new .NET 10 project and did not use Ix at all. (I don't know the history, but it's possible that the Rx team chose to use the unusual Await and WithCancellation naming conventions to avoid exactly this kind of ambiguity.)

Now you might have done all of this, and still find that if you attempt to remove your reference to the old System.Linq.Async NuGet package, your code no longer compiles. In which case, read on...

Relocated functionality

The new System.Linq.AsyncEnumerable (.NET runtime) library does not provide all of the functionality that System.Linq.Async (Ix) library did. This table describes the relevant extension methods for IAsyncEnumerable:

Operator	Feature
`AsAsyncEnumerable`	Similar to `Enumerable.AsEnumerable`—ensures only `IAsyncEnumerable`-typed operations are available when a type might have other extension methods available e.g. due to implementing multiple interfaces
`AverageAsync`	Projection-based overloads (e.g., `xs.AverageAsync(p => p.Age)`)
`SumAsync`	Projection-based overloads (e.g., `xs.SumAsync(p => p.Mass)`)
`ToObservable`	Adapts an `IAsyncEnumerable` to an Rx `IObservable`

Since our goal is for people to stop using System.Linq.Async, we can't just leave these methods in there. So we have moved them into System.Interactive.Async.

Historically, the split between System.Linq.Async and System.Interactive.Async was that the former contained 'standard' LINQ operators found on all LINQ providers, and the latter is the home for operators invented by the Rx team. Since the .NET Runtime team has decided that the operators listed above aren't standard, evidently they belong in System.Interactive.Async.

System.Linq.Async v7 includes a transitive reference to System.Interactive.Async, so it should 'just work'. But we are deprecating System.Linq.Async, and when a project stops using that, it might be necessary to add in a reference to System.Interactive.Async so that these non-standard methods or overloads remain available.

Note that System.Linq.Async also defined a ToEnumerable method that adapts any IAsyncEnumerable to IEnumerable. This is a 'sync over async' operation, and those are usually a bad idea, and we believe it was a mistake for System.Linq.Async ever to have offered this. We have elected not to provide a new version of that. If you need this, we suggest you rethink your design. And if after that you really think you still need it, well, Ix.NET is open source, so you can always find the original implementation, but our view is that you will be better off not using it.

There are also some extension methods from other types to IAsyncEnumerable that System.Linq.AsyncEnumerable did not duplicate:

Target	Method	Feature
`IObservable`	`ToAsyncEnumerable`	Adapts an Rx source to an `IAsyncEnumerable`
`Task`	`ToAsyncEnumerable`	Adapts a `Task` to an `IAsyncEnumerable`

Again, these are now available in System.Interactive.Async.

There is one non-extension static method formerly defined by AsyncEnumerable:

Method	Feature
`Create`	Callback based sequence creation

We've moved this into System.Interactive.Async, and unfortunately this is a case where we've had to make a breaking change where we can't help developers out with an Obsolete message. It is necessary for the System.Linq.Async package's reference assemblies not to define a public AsyncEnumerable type, because if they did, it would cause compiler errors in code that attempted to use static members directly. For example, if you wrote AsyncEnumerable.Range(1, 10), then although this method is now available in .NET 10, if System.Linq.Async defined its own AsyncEnumerable it would cause this error:

error CS0433: The type 'AsyncEnumerable' exists in both 'System.Linq.Async, Version=7.0.0.0, Culture=neutral, PublicKeyToken=94bc3704cddfc263' and 'System.Linq.AsyncEnumerable, Version=10.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'

If all the methods on AsyncEnumerable were extension methods this would matter less. (It would still be a potential problem because you are allowed to invoke extension methods using normal static method syntax.) C# doesn't care if two types with identical names both define extension methods: as long as the individual methods don't clash, there's no ambiguity problem. But if you ever refer to the defining class by name (i.e., you refer to the AsyncEnumerable class) then the existence of two definitions becomes a problem and you get that error.

So it is not possible for System.Linq.Async's public API to include an AsyncEnumerable type. For the most part this isn't a problem: for extension methods we can move them to a different type. (We call this AsyncEnumerableDeprecated, because the only reason System.Linq.Async's public API retains any of the methods that it used to define on AsyncEnumerable is to be able to provide [Obsolete] attributes telling you what to use instead, and for extension methods, those are equally effective even if we change the defining class name.) And most of the non-extension static methods we used to define are now available on the new System.Linq.AsyncEnumerable library's AsyncEnumerable.

But this one method, Create is an unfortunate exception. There's really nothing we can do other than remove if from the public face of System.Linq.Async, and define its replacement in AsyncEnumerableEx in System.Interactive.Async. We can retain binary compatibility because the runtime assemblies in System.Linq.Async continue to define AsyncEnumerable exactly as before. But code that was calling AsyncEnumerable.Create before will now just get an error reporting that this method does not exist, and the developer will have to guess that they now need to use AsyncEnumerableEx.Create. Our hope is that because this method hasn't been very useful since C# added support for IAsyncEnumerable iterator methods (yield return etc.) that not many people will be using it. The built in language support does the same thing only better.

The methods described so far in this section are ones that the .NET Runtime team did not consider to be 'standard' operators. Interestingly, there are some that Ix.NET didn't consider to be 'standard' that the .NET Runtime team did. The following methods were defined in System.Interactive.Async v6 because at the time they didn't align with standard operators (or at least, standard overloads) available on other LINQ implementations:

Operator	Feature
`Distinct`	Projection-based overload now available in `System.Linq.AsyncEnumerable` as `DistinctBy`
`MaxAsync`	Non-projecting overload previously considered non-standard is in `System.Linq.AsyncEnumerable`
`MaxByAsync`	Non-standard max-with-ties feature that Ix.NET is renaming as `MaxWithTiesAsync`; `System.Linq.AsyncEnumerable` defines an operator with this name that has a different return type and different behaviour
`MinAsync`	Non-projecting overload previously considered non-standard is in `System.Linq.AsyncEnumerable`
`MinByAsync`	Non-standard max-with-ties feature that Ix.NET is renaming as `MinWithTiesAsync`; `System.Linq.AsyncEnumerable` defines an operator with this name that has a different return type and different behaviour

The MaxByAsync and MinByAsync members are problematic because back when Ix.NET introduced these (and their non-async counterparts, MaxBy and MinBy, in System.Interactive), not only were they non-standard operators, they worked differently from the MaxBy and MinBy eventually introduced in .NET 6.0. Back in Ix.NET 6.0, System.Interactive had to hide the IEnumerable versions of these methods and introduce new MaxByWithTies and MinByWithTies operators to continue to make the functionality available. Unfortunately, System.Interactive.Async was not updated in the same way at that time, which is a pity because the old methods could have been retained at that time with Obsolete attributes, giving people time to move onto the new names. Since that didn't happen, we have no choice but to get out of the way of the new MinByAsync and MaxByAsync that .NET supplies without warning. Anyone with code that was using the existing methods of these names in System.Interactive.Async is going to get confusing error messages when they upgrade to .NET 10.0, and we can't easily supply guidance. (In theory we could write an analyzer to detect this, but it's complex, and remember we have no budget at all for this work. It's all eating into time we'd rather be spending working on Rx.NET.)

So if you have code like this:

IList oldest = await people.MaxByAsync(p => p.Age);
IList youngest = await people.MinByAsync(p => p.Age);

that compiles on .NET 8.0, you'll find that it fails to compile on .NET 10.0 (or if you stay on .NET 8.0 but add a reference to System.Linq.AsyncEnumerable):

error CS0029: Cannot implicitly convert type 'Person' to 'System.Collections.Generic.IList'

Note that if you prefer to use var, you won't get an error with this code when upgrading:

var oldest = await people.MaxByAsync(p => p.Age);
var youngest = await people.MinByAsync(p => p.Age);

but the library upgrade will change the type of these two variables. So instead of the compiler error occurring on the line where things went wrong, you'll likely get an error later on in the code when you attempt to use the relevant variable. (Or worse, you won't get a compiler error, but the meaning of your code will subtly change without you intending it to. But since this changes the variables' types from IList to Person, you will most likely get an error later on in the code.)

You'll need to change it to the following:

IList oldest = await people.MaxByWithTiesAsync(p => p.Age);
IList youngest = await people.MinByWithTiesAsync(p => p.Age);

(The reason Ix.NET's returns a list here is that there might be more than one item that has the highest value. The .NET runtime's MinBy and MaxBy pick one arbitrary winner.)

Target frameworks

The System.Linq.Async, and System.Interactive.Async packages now have an additional net10.0 target framework. They don't technically need it, because the changes made necessary by .NET 10's addition of System.Linq.AsyncEnumerable are also required on older targets because the new System.Linq.AsyncEnumerable package can also be used on those older targets.

So this is purely a cosmetic move, intended to signal that these packages are .NET 10-aware.

Since v7 of Ix.NET is entirely about ensuring we work well with the new System.Linq.AsyncEnumerable package, nothing else has changed, and so the System.Interactive, and System.Interactive.Providers packages support the same TFMs as before, with nothing later than net6.0. They still work perfectly well on .NET 10.0.

We will update the TFMs so that no package has a .NET TFM lower than net8.0 in a future release. (Most likely we will produce a v8 fairly soon that makes explicit that we no longer support .NET 6 or 7.) We just didn't want to conflate that with the fixes required to coexist with the new System.Linq.AsyncEnumerable.

Method hiding: technical details

I've mentioned a few times that we have hidden some methods in System.Linq.Async, but I've not explained what that means.

We can't just remove methods from System.Linq.Async. If we did that we would break binary compatibility. Suppose you're using a library called UsesOldIx that was built against System.Linq.Async v6, and your application upgrades to v7 of System.Linq.Async. And suppose that the UsesOldIx library has not been updated. Your application will now be supplying UseOldIx with v7 of System.Linq.Async even though UseOldIx was built against v6. So UseOldIx doesn't know anything about the new System.Linq.AsyncEnumerable—it will expect all these LINQ to IAsyncEnumerable methods still to reside in System.Linq.Async.

If we just removed methods completely, scenarios like this would cause the application to crash with a MissingMethodException. So we use a trick: we hide the method at build time, but continue to make it available at runtime.

We do this by supplying separate runtime and reference assemblies in the NuGet package. If you look inside the System.Linq.Async package you'll find that as well as the usual lib folder, there's also a ref folder. When both are present, the .NET build tools tell the compiler to look at the assemblies in the ref folder. This enables us to remove methods from the public API at build time—we omit them from the reference assembly—but to leave them present in the runtime assemblies (in the lib folder) so any code that was already built against an older version will still be able to access the hidden methods.

System.Interactive was already using this in v6 to deal with the Min/MaxBy issues. As described above it effectively had to rename MinBy to MinByWithTies and MaxBy to MaxByWithTies because .NET 6 had added new methods with these names that did different things. So these had to be hidden but not entirely removed. So this trick is not new—Ix.NET has already been using it for a while, and now we're using it in System.Linq.Async too.

History

You might be wondering why the LINQ implementation for IAsyncEnumerable was a community supported project in the first place.

In fact the code that ultimately ended up in the Ix.NET System.Linq.Async package did originate from Microsoft. More specifically, it was an invention of the Rx.NET team, back when that was part of Microsoft. So the reason System.Linq.Async is a community supported project is that Rx.NET itself became a community-supported project.

(For a highly detailed account of this history and related events, you can request a copy of the 'A Little History of Reaqtor' ebook from https://reaqtive.net/)

So the real question is: why did the Rx.NET team end up implementing LINQ to IAsyncEnumerable? And the answer is that IAsyncEnumerable itself was also originally invented by the Rx.NET team.

Rx.NET team emerged from a group at Microsoft that formed back when cloud computing was just getting established, who were asked to investigate what the cloud would mean for ordinary developers, and how software development might need to change to be able to take advantage of cloud-native architectures. One of the main researchers in this group was Erik Mejier. He had also been a significant contributor to the design of LINQ. He had always envisaged LINQ as being more general than merely enabling database integration in C# and VB.NET, and Rx.NET was one realisation of his vision. When async/await were being developed, it would have been a natural step for him to consider what that might mean for LINQ, and so it is perhaps unsurprising that his Rx.NET team produced the original IAsyncEnumerable interface definition.

If you download System.Interactive.Async v3.0.0, unzip the nupkg, and open up System.Interactive.Async.dll in ILDASM, you'll see that this library defines IAsyncEnumerable and IAsyncEnumerator. In fact, the Rx.NET team first published definitions of these interfaces back before NuGet existed, in 2010!

So IAsyncEnumerable had been around for a decade by the time it was made an integral part of .NET, with the release of .NET Core 3.0. Rx.NET has always been very closely associated with LINQ, and so it also had a LINQ implementation for that whole time.

I don't know the history of why the .NET team chose not to provide LINQ for IAsyncEnumerable back then. It's possible that the fact that there was no longer an Rx.NET team within Microsoft to advocate for this didn't help. Perhaps LINQ was out of favour within Microsoft at that time, and they underestimated the demand for this feature. And the fact that the (now open source) Rx.NET project already had an implementation (and the project's supporters, including some still working at Microsoft at the time, leapt into action to update that implementation to align with the fact that IAsyncEnumerable had now moved into the runtime libraries) enabled them to believe that there was no need for the .NET runtime class library team to fill the gap.

There certainly was demand for this feature. The Rx.NET team packaged it in a few different ways over the years, but if we look just at the System.Linq.Async library the Rx.NET maintainers produced back in 2019, that has had a quarter of a billion downloads! And it is such an obviously useful feature that many people just assumed that the library was actually part of .NET. The package name certainly contributes to that perception—historically Rx.NET has always used System prefixes for its namespaces, because the Rx.NET team's vision was that Rx should be built right into .NET. (IObservable did indeed make it into the class libraries in .NET Framework 4.0. And the versions of .NET that shipped in Windows Phone did actually include full Rx.NET libraries.)

But Microsoft has supplied no funding for the Rx.NET project for well over a decade. This made it impossible to keep up with people's expectations for what a high quality implementation of LINQ for IAsyncEnumerable should be. When endjin took over maintenance of Rx.NET back in 2023, the company's owners very generously decided to pay for the time I spend working on it, but our motivation for this was that we believe in Rx.NET and want to keep it thriving. We ended up become responsible for the other projects in https://github.com/dotnet/reactive as a side effect, and we never wanted to be the guardians of LINQ to IAsyncEnumerable.

Fortunately, Microsoft had by this time recognized that .NET developers expect LINQ for IAsyncEnumerable to be available, and fully supported, and they offered to build it into the .NET runtime class libraries. And so, today, .NET 10.0 provides built-in support. We're happy that this is available to the .NET world, and that we can now focus our efforts on Rx.

Please try it out

This new 7.0 release of System.Linq.Async (and the corresponding System.Interactive.Async) is available on NuGet today. If you're using Linq to IAsyncEnumerable, please try upgrading (even if you're not yet on .NET 10). If you have any problems, please file issues at https://github.com/dotnet/reactive/issues. Meanwhile, we hope you enjoy this new version of the Interactive Extensions for .NET.

The Data Product Canvas: The Theory Behind The Canvas

Barry Smart — Wed, 22 Oct 2025 05:30:00 GMT

TLDR; The Data Product Canvas combines two powerful frameworks: the Business Model Canvas, which revolutionized strategic planning through visual collaboration, and Data Mesh, which reimagines data as a product. By understanding these theoretical foundations, you'll gain deeper insight into why the canvas works and how to apply it most effectively in your organization. This final part of our series examines the origins of these frameworks and how they've been synthesized into a practical tool for data product design.

In Part 1 of this series, we introduced the Data Product Canvas as a framework for designing data products that deliver real business value. Part 2 explored each of the nine building blocks in detail, and Part 3 showed the canvas in action through a worked example. Now, in this final instalment, we'll explore the theoretical foundations that make the canvas so effective.

Understanding the origins and principles behind the Data Product Canvas isn't just an academic exercise. By grasping the foundations, you'll be better equipped to:

Adapt the canvas to your specific organisational context
Explain its value to stakeholders and secure buy-in
Apply its principles beyond the canvas itself to create a more product-centric data culture
Recognize when and how to modify the approach as your data maturity evolves

Let's explore the two major frameworks that inspired the Data Product Canvas:

Business Model Canvas - a visual tool for designing, exploring and iterating on new strategies and product ideas.
Data Mesh - specifically the concept of "data as a product", which is one of the 4 key principles applied in a Data Mesh architecture.

Business Model Canvas

The Business Model Canvas (BMC) was developed by Alex Osterwalder and Yves Pigneur, with Osterwalder completing his PhD dissertation on business model ontology at the University of Lausanne in 2004. The formal publication came with their book Business Model Generation in 2010, which marked the canvas's widespread introduction to the business world.

The Business Model Canvas marked a paradigm shift in business planning. By converting complex strategic thinking into a visual, collaborative format, it democratized strategy development and moved organizations away from static, text-heavy business plans toward dynamic, testable business hypotheses.

I discovered it on a bookshelf in Waterstones in my home town. It immediately grabbed my attention as it was the first business strategy book I had come across which:

Provided a visual alternative to traditional, text-heavy business plans.
Seemed to make strategy development more accessible and collaborative, reducing the barrier to strategic thinking.
Allowed organisations to quickly imagine, prototype and iterate business models on paper before committing large amount of resources to an idea. Allowing them to shift from traditional annual business planning to more dynamic, iterative approaches.
Encouraged cross-functional collaboration in strategy development, in particular solving the "IT / business" divide.
Facilitated easier communication of complex business strategies.
Places the end-centred design at the heart of driving the design of new products and services.

It simplifies complex strategic thinking into nine fundamental building blocks:

Customer Segments - the specific groups of people or organizations a business aims to serve and create value for.
Value Proposition - the unique product or service that solves a customer problem or helps them to achieve a goal.
Channels - the ways a company communicates with and delivers its value proposition to customers.
Customer Relationships - the types of interactions and connections a business establishes and maintains with its customer segments.
Revenue Streams - the various methods a company uses to generate income from its customer segments.
Key Resources - the critical assets and resources required to deliver the value proposition effectively.
Key Activities- the most important actions a company must perform to deliver its value proposition and operate successfully.
Key Partnerships - the strategic relationships and networks that support and enhance the business model.
Cost Structure - the total expenses associated with operating the business model across all its dimensions.

I first remember putting the Business Model Canvas into use in 2009 at ScottishPower. We used it to explore the future of customer experience within their Energy Networks business. It helped bring together a multi-disciplinary team that included the head of customer services, other members from her department, six sigma black belts and technology professionals. In the space of two hours we used a whiteboard version of the canvas and post-it notes to build situational awareness and explore possible strategies. It allowed us to develop a shared understanding, a common language for describing the challenges / opportunities ahead and clarity of purpose that we could all get behind. An immediate success!

I went onto use it in other roles. In one other notable case, it was used to help stakeholders in a 100 year old financial services firm understand how other organisations had successfully adopted new "digitally enabled" business models to disrupt their respective industries, diversify and sustain their business over the long term. We met on a monthly basis and used the Business Model Canvas to explore different case studies. It was "the thing that helped to get us to the thing" - in this case "the thing" being a deeper understanding how technology and data could be a force for good when it came to sustaining the firm for a further 100 years.

I wasn't alone. The tool was adopted by organisations of all shapes and sizes: corporations, non-profits, and government organisations. A global community of practitioners emerged, it was widely adopted in business schools and entrepreneurship programs and Strategyzer (Osterwalder's company) became a global force in advising companies on business model innovation.

In 2019, following a "career reboot," I have been emersed in the world of data and AI. I had thought I would no longer need Business Model Canvas until I came across Data Mesh.

Data Mesh

Data Mesh was first conceived by Zhamak Dehghani in 2018 while working as a principal technology consultant at ThoughtWorks. Dehghani developed the concept as a response to the limitations of traditional centralized data architecture approaches, particularly in large, complex organizations.

While Data Mesh is often discussed primarily as an architectural approach, its most transformative aspect may be its recognition that data is fundamentally a socio-technical challenge. By treating data as a product with internal customers, it shifts the focus from technical implementation to value delivery.

Fundamentally, Data Mesh is a socio-technical approach to data management that treats data as a product. It challenges traditional centralized data warehouse models by promoting decentralized domain-oriented ownership, self-serve data infrastructure, and policy as code.

The recognition that "data is a socio-technical endeavour" and the principle of "data as a product" in particular really appealed to me. It is aligned with the way that endjin have been approaching the delivery of data and analytics for the last decade. Largely I believe due to both Zhamak Dehghani and all of us at endjin coming from a software engineering background.

It recognizes that people are fundamental to success. If there are any issues with the people-related aspects, it will fundamentally undermine the overall success of that investment by impeding the flow of value, the understanding of end user needs, and the willingness of end users to adopt the solution, put it into use, and generate value from it.

By adopting a product mindset, data teams start to "think like entrepreneurs," adopting a proactive approach to delivering value while also accepting the responsibility for resource-allocation decisions by creating a clear link between data products, business value, and total cost of ownership (TCO), balancing inputs and outputs accordingly.

To succeed, these teams need to develop new tools to help them iterate on ideas to define what a "data product" actually is, and what's required to discover, build, and own a successful data product.

One such tool is a version of the Business Model Canvas that we have adapted for Data Products.

Data Product Canvas = Business Model Canvas + Data Mesh

The Data Product Canvas is a version of the Business Model Canvas adapted for iteratively envisaging and evaluating ideas for data products. We've renamed the 9 building blocks to better reflect the specific context of data products:

Customer Segments ➡️ Audience - the specific groups of people that the data product is aiming to create value for.
Value Proposition ➡️ Actionable Insight - the data driven actionable intelligence that will be delivered by the data product to allow the audience to achieve a specific goal.
Channels ➡️ Consumption - the means through which the audience will access and use the data product.
Customer Relationships ➡️ Adoption - the support that will given to the audience to enable them to successfully discover and use the data product.
Revenue Streams ➡️ Lifetime Value - the value (tangible and intangible) that the data product is aiming to deliver over its lifetime.
Key Resources ➡️ Data Skills, Tools and Methods - the key capabilities that will be required to deliver and sustain the data product over its lifetime.
Key Activities ➡️ Data Processing - the actions that will need to be performed to transform data sources into the actionable insight.
Key Partnerships ➡️ Data Sources - the data sources which are required to deliver the actionable insight.
Cost Structure ➡️ Total Cost of Ownership - the projected costs to design, build, test, operate, maintain and evolve the data product over its lifetime.

Why This Combination Works

The synthesis of these two frameworks creates something greater than the sum of its parts. Here's why this combination is so powerful:

From the Business Model Canvas:

Visual collaboration: The canvas format facilitates cross-functional discussions and breaks down silos between business and technical stakeholders.
Holistic thinking: The nine-block structure ensures teams consider all critical aspects of a successful product, not just the technical implementation.
Rapid iteration: The format encourages teams to quickly explore, validate, and refine ideas before committing significant resources.
Strategic alignment: By making value explicit, the canvas ensures data initiatives support broader business objectives.

From Data Mesh:

Product mindset: Treating data as a product shifts focus from technical implementation to user value and adoption.
Socio-technical approach: Recognition that successful data initiatives require addressing both human and technical aspects.
Domain orientation: Focusing on specific business domains rather than trying to solve all data problems at once.
Value accountability: Making teams responsible for both the costs and benefits of their data products.

The canvas works even if you're not implementing Data Mesh as an architecture. The product-thinking principles it embodies are valuable regardless of your technical approach to data.

Beyond the Framework: Building a Data Product Culture

While the Data Product Canvas is a powerful tool, its greatest value may be in how it shifts organizational thinking about data. By applying the canvas consistently, organizations begin to develop:

Purpose-driven development: Moving from technology-first to purpose-first thinking
User-centered design: Deeply understanding who will use data products and how
Value measurement: Quantifying and tracking the actual business impact of data initiatives
Lifecycle management: Considering the entire data product lifecycle, from inception to eventual retirement
Sustainable resourcing: Making explicit decisions about ongoing investment based on realized value

This culture shift is often more transformative than any specific data product that emerges from using the canvas.

Conclusion

The Data Product Canvas represents a powerful synthesis of business model thinking and data product principles. By combining the visual collaboration and holistic thinking of the Business Model Canvas with the product mindset and socio-technical approach of Data Mesh, we've created a practical tool that helps organizations overcome the most common causes of data product failure.

Throughout this four-part series, we've:

Introduced the canvas and explained why traditional approaches to data products often fail
Explored each building block in detail with practical guidance
Demonstrated the canvas in action through a real-world scenario
Examined the theoretical foundations that make the canvas effective

The Data Product Canvas isn't just another framework to follow. It's a catalyst for a fundamentally different approach to data initiatives—one that starts with purpose, focuses relentlessly on user needs, quantifies value, and considers implementation realities. By adopting this approach, organizations can dramatically improve their success rate with data products and build a sustainable competitive advantage through data.

Whether you're just beginning your data journey or looking to improve the impact of an established data practice, the canvas provides a structured, collaborative way to ensure your data investments deliver real business value. We encourage you to download the canvas, adapt it to your needs, and use it to transform how your organization approaches data products.

The journey to becoming truly data-driven isn't about technology—it's about purposefully creating products that help people make better decisions and take more effective actions. The Data Product Canvas is your guide on that journey.

If you would like an Adobe Acrobat version of the canvas, please reach out to us.

The Data Product Canvas in Action

Barry Smart — Tue, 21 Oct 2025 05:30:00 GMT

TLDR; Put yourself in the shoes of a Head of Data and Analytics as you navigate the challenges of creating a high-impact data product for a struggling garden center chain. Experience first-hand how the Data Product Canvas helps you align business goals with technical capabilities, anticipate adoption challenges, and quantify value — all within one week. Through a detailed narrative, see how each canvas component comes to life, ultimately leading to a data product with potential 10X return on investment that wins enthusiastic board approval.

In Part 1 of this series, we introduced the Data Product Canvas as a framework for designing data products that deliver real business value. In Part 2, we explored each of the nine building blocks in detail. Now, it's time to see the canvas in action through a worked example that demonstrates how it shapes decision-making in a real-world scenario.

Here is a completed canvas for our example:

This was created using a Microsoft Visio template. If you are interested in a copy of this, please reach out to us at hello@endjin.com.

Background: Your First 100 Days

After the boom that came about as a result of the Covid-19 pandemic, when millions of people spending more time at home found a new interest in tending to their gardens, GreenGrow (your garden center chain employer) has struggled to maintain a steady, predictable stream of revenue. At the same time, your company's main competitor is enjoying strong performance, putting pressure on the board.

You've just completed your first 100 days as the new Head of Data and Analytics. Walking into the boardroom for your first major strategy presentation, you feel a mix of nervousness and excitement. This is your moment to make a strong first impression, but also to set realistic expectations about what's possible.

You take a deep breath and begin your presentation. You outline how you want to deliver value rapidly and frequently through targeted data products, each focusing on a specific area of the business where data can deliver high-value impact. You also emphasize something that many previous tech leaders have failed to communicate effectively - that your team can't succeed in isolation.

"Data is a socio-technical endeavour," you explain, seeing a few puzzled looks. "Even the best technical solution will fail if we don't address the cultural and organizational aspects that often become barriers."

You share several case studies of how other organizations have transformed and gained competitive advantage by treating data as an asset. To your relief, the board members lean forward, actively engaged in the conversation. They're forward-thinking and can see the opportunity to take better decisions, more rapidly, and better understand the dynamics of their business by becoming more data-driven.

When you ask the board to prioritize which areas of the business your team should focus on, they're unanimous: address the recent string of poor quarterly financial results where revenue performance has been well below forecasted levels. They want to understand the root causes to stabilize performance and restore investor confidence.

They also mention a secondary concern: declining customer satisfaction. Recent Net Promoter Score (NPS) surveys show a marked decrease in customer happiness. Could this be linked to the revenue issues?

The Chief Financial Officer immediately offers to sponsor this work. She's a tech-savvy CFO who sees numerous opportunities to use a data-driven approach to drive decision making. "Let's work together on this," she says. "I can help you get time with the key stakeholders. Let's come back to the board next week with some data product proposals that could help meet these goals."

You leave the meeting feeling elated - it went even better than you'd hoped! You have full backing to move forward and clear direction on which areas to focus on.

But reality quickly sets in. You only have a week to develop compelling data product ideas for the board to consider. Fortunately, you have two key advantages: the CFO's active support and a tool you've used successfully before - the Data Product Canvas.

Your First Strategy Session

In your initial catch-up with the CFO after the board meeting, you start exploring the challenge in more depth. She immediately hones in on one specific issue that has undermined both revenue targets and customer satisfaction: failures across all stores to proactively build up stock of products that are coming into peak demand.

"Have you seen this?" she asks, pulling up a recent press article on her tablet. The headline reads "Garden Centers Leave Customers Disappointed by Empty Shelves." The article highlights customer frustration during Spring when they arrived at stores to find they had sold out of high-demand seasonal items like vegetable seeds, annuals, and bulbs.

"This isn't an isolated incident," she continues. "We've had similar problems at other peak times. It's killing our revenue and driving customers to competitors."

You ask how stock levels are currently determined. The CFO explains that they're set by individual garden center managers based on experience and "gut feel." Given that the company has grown through acquisition of independent garden centers, these managers tend to have their own opinions on how to run their stores.

"I should warn you," she adds, "you may face resistance to new data-driven methods. These managers are used to doing things their way."

You make a mental note of this potential adoption challenge. "That's really helpful to know upfront," you reply. "Would you be willing to help create incentives for store managers to engage with whatever solution we develop?"

She nods. "Absolutely. I can build it into their performance metrics. And we should definitely monitor adoption and outcomes closely."

You both agree there's an opportunity to get something in place within the next three months to catch the upcoming winter season when there's another surge in demand - this time for Christmas trees and decorations.

As you leave the meeting, your mind is already mapping out the canvas approach. This is exactly the kind of challenge where the Data Product Canvas can shine - helping to rapidly define a focused solution with clear business impact.

Populating the Canvas: Your Journey Begins

You remember the recommendation from Part 2 of this blog series: start with a purpose-driven approach. That means beginning with understanding the Audience to validate the Goal and capture the Actionable Insight that will enable that audience to achieve their goal.

Starting with Audience

You set up brief interviews with three different Garden Center Managers. Going into these conversations, you're conscious of potential resistance, so you focus on building rapport first, showing genuine interest in their roles and challenges.

To your relief, they respond positively. You listen carefully as they describe their daily reality - the constant juggling act between dealing with operational firefighting and trying to plan ahead. Their primary concerns revolve around keeping customers happy so they return regularly, and retaining their staff.

"Most days, I barely have time to breathe," one manager tells you with a sigh. "I'm constantly putting out fires - dealing with staffing issues, customer complaints, supplier problems. Finding time to think strategically about what stock we'll need in three months? That's a luxury."

You note these time constraints - any solution you develop will need to be efficient and low-effort for managers to adopt.

Identifying the Goal and Designing the Actionable Insight

When you bring up the board's goal of maximizing revenue and increasing customer satisfaction, the managers nod in agreement. "Absolutely, that's what we're all trying to do," one says.

"What actions are you empowered to take that would help achieve this goal?" you ask.

They immediately highlight stock management and the recent incidents where garden centers ran out of items in high demand.

"The worst feeling is seeing a customer walk out empty-handed because we don't have what they want," one manager explains. "But the flip side is, I'm really nervous about over-ordering. I hate seeing good product going to waste."

Another manager adds, "I just don't have the time to analyze what items might be coming into demand in the next few months. And our ordering system takes forever to use. I'd love some help with knowing what to order and when."

You're beginning to see a clear picture of the problem and potential solution. You settle on the key question that needs answering: "What products should we be stocking to maximize sales next month?"

When you ask how this information would need to be presented to be useful, they're specific: "A simple list of products predicted to be in high demand next month, showing current stock level, target level, and the gap we need to fill."

You validate the action this would enable: proactively ordering stock from suppliers to meet upcoming demand - which directly contributes to the revenue and customer satisfaction goals.

As you wrap up these conversations, you feel a growing sense of excitement. There's a clear opportunity here, with a direct line between data, insight, and business impact.

Consumption and Adoption: Making It Work in the Real World

Having outlined the actionable insight, your focus turns to what will be required for garden center managers to successfully adopt and use it.

Initially, you explore the possibility of automated integration into the corporate ERP platform, but after discussing the technical realities with your team and IT, you agree that in the short term, it's best to have a human in the loop. Based on this, you decide an online report filtering dynamically to focus on an individual garden center's stock level is the best presentation method.

Thinking about adoption, you identify several important considerations:

During your interviews, you discovered the garden center managers are generally not familiar with Power BI, your preferred reporting tool. "I'll need to create a video walkthrough to help them get comfortable with the new report," you think to yourself.

You also learned that at least one garden center manager is colour blind (which isn't surprising, given that 1 in 20 people have color vision deficiency). You make a note to design the report with accessibility in mind.

Given the engagement level you've seen from the managers, you see an opportunity to build community around the new data product. "A monthly town hall session would be perfect for gathering feedback and ideas about future enhancements," you think. "That could really help with adoption."

When exploring consumption in more depth, several requirements become clear:

The output needs to be a printable table of products with recommended stock levels and ordering levels, so managers can use it during stock checks and order entry. You make a note that a paginated report will be required.

You confirm with the CFO that the report should refresh on a monthly cycle, with email notifications to users. She also specifies that, initially, garden center managers should only see data for their own store, meaning you'll need to implement row-level security.

Lifetime Value: Quantifying the Impact

Working with the CFO, you create a view of the lifetime value of the data product over five years. As you quantify the potential impact on revenue, reduction in write-offs (waste), and customer satisfaction, you both become increasingly excited.

"If we can reduce stockouts by even 25%, the revenue impact would be substantial," the CFO calculates. "And the reduction in waste from more precise ordering could save us significant costs too."

The numbers are compelling. Looking at the analysis, you feel a surge of confidence - this idea could be transformative for the business.

Data Sources and Processing: Exploring Feasibility

Now it's time to assess technical feasibility. You bring together your team and partners from IT to identify data sources and processing requirements.

To your relief, most of the data you need exists in the corporate ERP platform. However, as is often the case, there are secondary data sources maintained in Excel spreadsheets containing important reference data.

Your team assesses the quality of these data sources, identifying some issues that will need addressing. "This is manageable," your data engineer confirms. "We'll need to build in some quality checks, but the core data looks solid."

Data Skills, Tools and Methods: Identifying Capability Gaps

When evaluating the capabilities needed, you identify a gap around machine learning model development. While your team is confident about the data engineering and report building aspects, they express concern about the ML component.

"We can handle the ETL and reporting," your senior analyst says, "but the demand prediction model is outside our comfort zone. We've not done anything quite like this before."

You consider the options and decide the best approach in the short term is to bring in external expertise from a consultancy you've worked with previously. This will add cost, but it will also help de-risk the project and provide valuable knowledge transfer to your team.

You're also acutely aware that ML projects are experimental by nature. There's no guarantee you'll be able to develop a model that meets the minimum acceptance criteria. You make a point of stressing this with the CFO during your next meeting.

"I appreciate your honesty," she responds. "What would you say is a reasonable budget to test if this is feasible before we commit to full implementation?"

You discuss an appropriate "learning budget," and she agrees to your proposed figure. "If it works, the ROI will be tremendous. If not, we'll have learned something valuable without breaking the bank."

Total Cost of Ownership (TCO): Making the Financial Case

Finally, you work with colleagues in IT and finance to build a comprehensive view of the TCO over the five-year lifetime of the data product.

The analysis indicates a potential 10X return on investment - far exceeding the typical threshold for project approval. The CFO is visibly excited as you review the numbers together.

"This is exactly the kind of focused, high-impact initiative we need," she says. "I'm confident the board will approve this. Let's finalize the presentation."

The Board Presentation: Moment of Truth

A week after the initial board meeting, you return to present your proposal. Despite having slept only a few hours the night before (finalizing the presentation took longer than expected), you feel confident.

The Data Product Canvas has helped you create a comprehensive, well-thought-out proposal in just one week. You've identified a clear business need, designed a targeted solution, assessed technical feasibility, and quantified the potential value.

As you walk the board through the proposal, you can see heads nodding. The CFO adds her strong endorsement, emphasizing how the solution directly addresses their concerns about revenue performance and customer satisfaction.

When you acknowledge the experimental nature of the ML component and the "learning budget" approach, the board appreciates your transparency. "This is a refreshing change," the CEO comments. "Usually, tech projects promise the moon and then fail to deliver. I like this pragmatic approach."

After some thoughtful questions, the board unanimously approves the initiative. As you leave the boardroom, you feel a mixture of elation and the weight of responsibility. You've secured approval and set appropriate expectations - now comes the hard part of delivering.

Reflecting on the Journey

Back in your office, you review the completed canvas once more. In just one week, it's helped you transform a broad strategic mandate into a specific, actionable data product with clear business value.

What strikes you most is how the canvas helped balance business and technical considerations. Starting with the audience and their needs kept the solution focused on delivering real value, while the systematic exploration of technical requirements ensured you weren't promising something impossible.

You also appreciate how the canvas helped you identify and address potential adoption challenges upfront. By involving garden center managers early and understanding their needs and constraints, you've designed a solution they're more likely to embrace.

Most importantly, the canvas provided a structured way to communicate your thinking to the board and secure their support. The comprehensive nature of the analysis gave them confidence that you'd considered all key aspects of the initiative.

As you send the approved canvas to your team to begin implementation planning and detailed design, you reflect on how different this feels from previous technology initiatives you've been involved with. Instead of starting with technology and hoping it creates value, you've started with value and found the right technology to deliver it.

Conclusion

The Data Product Canvas transformed what could have been a stressful week of scrambling to develop proposals into a structured process that produced a compelling, well-thought-out data product concept. By systematically working through each building block, you were able to:

Deeply understand user needs and constraints
Identify a focused, high-impact opportunity
Design a solution that balanced ambition with feasibility
Anticipate and address potential adoption challenges
Build a compelling business case with quantified value

Most importantly, the canvas helped you navigate the socio-technical complexity of data initiatives. It wasn't just about data and algorithms - it was about people, processes, and organizational dynamics.

This experience reinforces why a purpose-driven, holistic approach to data products is essential. By considering all aspects - from user needs to technical implementation to organizational change - you've dramatically increased the likelihood of delivering real business impact.

For anyone in a data leadership role facing similar challenges, the Data Product Canvas provides a structured path from vague strategic directives to concrete, valuable data products. It helps you ask the right questions, involve the right stakeholders, and build solutions that deliver tangible business value.

The Data Product Canvas: Deep Dive into the Building Blocks

Barry Smart — Mon, 20 Oct 2025 05:30:00 GMT

TLDR; The Data Product Canvas provides a structured approach to designing high-value, sustainable data products. This strategic tool brings stakeholders together to align business needs with technical capabilities before committing resources. This deep dive explores each of the nine building blocks, showing you how to complete them effectively to maximize your chances of success.

In Part 1 of the blog, we introduced the Data Product Canvas as a blueprint for success - a simple yet powerful tool that aims to bring stakeholders together to shape data products that deliver real value.

It does this by presenting 9 simple building blocks that will encourage you to think holistically about the data product.

Audience - the specific groups of people that the data product is aiming to create value for.
Actionable Insight - the data driven actionable intelligence that will be delivered by the data product to allow the audience to achieve a specific goal.
Consumption - the means through which the audience will access and use the data product.
Adoption - the support that will be given to the audience to enable them to successfully discover and use the data product.
Lifetime Value - the value (tangible and intangible) that the data product is aiming to deliver over its lifetime.
Data Sources - the data sources which are required to deliver the actionable insight.
Data Processing - the actions that will need to be taken to transform data sources into the actionable insight.
Data Skills, Tools and Methods - the key capabilities that will be required to deliver and sustain the data product over its lifetime.
Total Cost of Ownership - the projected costs to design, build, test, operate, maintain and evolve the data product over its lifetime.

Here is the blank canvas:

If you would like an Adobe Acrobat version of the canvas, please reach out to us.

In this blog, we describe each of the blocks in more detail in the general order that we recommend you should complete the canvas. Providing reference material that we hope will help you to adopt the canvas successfully.

In which order should you complete the canvas?

We recommend starting with the centre of the model to capture the actionable insight. This ensures that you start with "Why?" and adopt a purpose driven approach to designing your data product.

Having captured the actionable insight, we then recommend you then focus on the right-hand-side of the canvas to reinforce the principle that successful products apply user-centred-designed through a deep understanding of a the audience who will use the product.

Once you have validated the fit between actionable insight and audience, you should then complete the left-hand-side of the model to test the feasibility by validating the data, processes and wider capabilities are in place to enable the actionable insight to be delivered.

This approach is illustrated below:

Note - the process of completing the canvas is seldom linear. You will tend to loop back and iterate on all parts of the canvas as you uncover new information and develop your understanding. So don't be constrained to the flow illustrated above!

Actionable Insight

Start with Why? by defining the actionable insight.

An actionable insight is a "quantum" of functionality. It should have a narrow focus based on:

Goal - what goal is the audience seeking to achieve?
Question - what question needs to be answered that will enable them to achieve this goal?
Insight - what information is required to answer that question?
Action - what action will the insight enable the audience to take that will contribute to the goal?

See some examples below:

Goal	Question	Insight	Action
Protect revenue	Which clients are we at risk of losing?	Top 10 clients at risk	Proactive intervention to engage with clients who deemed to be at risk
Maximise revenue, increase customer satisfaction	What products should we be stocking to maximise sales next month?	Top 10 products in high demand next month	Proactively order stock from suppliers to meet demand for coming month
Reduce customer churn, achieve SLAs	What staffing levels do I need tomorrow?	Predicted demand and recommended staffing level for the next shift	Authorise overtime to tomorrow to proactively resource up for anticipated spike in demand

Mapping out these elements will ensure the data product will:

Deliver insights that can be acted upon
Contribute directly to a goal of the organisation
Be measurable in terms of value and impact

The clearer this understanding, the more likely the data product will succeed. It will help you to avoid "scope creep" and keep all stakeholders focused on the goal.

You will often need to do some detective work to uncover the goal and actionable insight. Users tend to communicate the "solution" rather than the "requirement", so be prepared to ask "Why?" a few times to get to the raw requirements, for example:

End user: "We need a new stock dashboard."
You: "Why do you need this report?"
End user: "So we can see the level of stock we have for each product at each store."
You: "Why do you need to understand stock levels at this level of detail?"
End user: "We use this information to understand where stock levels are running low relative to anticipated demand." (the INSIGHT)
You: "Why do you need to understand when stock levels are running low?"
End user: "So we can proactively order stock from suppliers and prevent the shelves from going empty." (the ACTION)
You: "Why is this important?"
End user: "It is critical to maximising revenue and to keep our customers happy and returning to shop with us!" (the GOAL)

Other aspects of the actionable insight that you should also consider at this stage:

Related data products - other data products that this actionable insight could leverage. This is typically an upstream data product which it can consume as an input.
Service levels - define the service level objectives that need to achieved for the actionable insight to be trustworthy. Typically, this would include considerations such as:
- Accuracy
- Completeness
- Availability
Risks - the risks that you have identified that could manifest to prevent the actionable insight from being delivered in a sustainable and reliable way.

Audience

In this building block you identify the audience that is responsible for putting the data product into use to generate value.

Each action in the examples provided in the table above should be something the audience has the authority and capability to execute. For instance:

Customer service managers can authorize overtime
Procurement teams can order stock
Account managers can intervene with at-risk clients

If the insight suggests actions that the audience can't take, it's not truly "actionable" for them. This connects back to understanding the audience's "key activities" - what they're actually empowered to do in their role.

If your audience can't take action based on the insight you provide, you haven't created an actionable insight—you've created an interesting observation.

A good product is founded on a deep understanding of end users and designing it with their needs in mind. The most successful products are loved by end users because they generate value for them by allowing them to overcome a specific "pain" or achieve some kind of "gain".

Without a clear understanding of your audience, even the most technically impressive data product will struggle to find adoption.

This building block encourages us to understand the end users, existing methods, how they will prefer to consume the data product to achieve a specific goal. Here we identify the "personas" who will consume the data product. By persona we mean a specific person (e.g. the Chief Financial Officer), role (e.g. Customer Service Agents) or demographic group (e.g. University Students).

For each persona we should explore:

Responsibilities - what business outcomes are they responsible for and how is their success measured?
Key Activities - what actions are they authorised to take in order to fulfil their responsibilities? What resources do they have control over in terms of budget, team etc.?
Inhibitors - what gets in the way of them achieving their responsibilities? This could be information gaps, decision-making bottlenecks, compliance challenges and quality issues.

We should also consider factors that will have a direct impact on the design of the data product such as:

Level of data literacy and technical skills when it comes to embracing the data product
Existing tools that they use
Time constraints
Existing work patterns

By understanding your audience, you will ensure that the data product will:

Deliver an insight the audience can actually act on
Fit naturally with existing skills and workflow
Overcome barriers to adoption

The clearer the understanding you build, the more likely the data product will achieve adoption and generate value.

Consumption

Define how the audience will access and consume the actionable insight:

Nature of the actionable insight - in other words, the delivery mechanism or means through which the actionable insight is consumed: report, dashboard, alert, document, tabular data, data API, interactive model, knowledge graph etc.
Device - the type of device they will access the actionable insight on. This can have significant implications for the data product. This could be a personal device, or some kind of downstream platform which is seeking to consume the data product.
When? - how often will they actually need to consume the data product - hourly, daily, weekly, monthly? How up to date will the actionable insight need to be to be useful? How will they be notified when new insights are available or have been refreshed?
Security & permissions - is concerned with authentication and authorization controls. How will users (or machines) authenticate? What information should they see / permissions should they have depending on their role?

The most sophisticated analysis is worthless if delivered in a format users can't or won't consume. Meet users where they are, not where you wish they were.

Adoption

In this area of the canvas we consider how end users will find, gain access to and successfully adopt the data product:

Discovery - explains how users find, learn about and access the data product. You should also consider established users will be notified when new features (versions) have been developed.
Design - a topic that is often overlooked and can have a significant impact on the successful adoption of a data product. This is specific to the nature of the data product. Reports will be subject to design considerations such as visual impairments and branding, whilst APIs will be subject to design considerations such as design of API (GraphQL versus OpenAPI), availability of "try it out" web site and code samples.
Documentation, Support & Training - describe the documentation, training and support will be available. You should also consider who should they turn to when they have a question or encounter an issue. Documentation should include information about data quality (data contract) and lineage.
Feedback loop - describe how end users will provide feedback about their experience using the data product. Also consider how usage and adoption be measured over time.

Lifetime Value

Quantifies the business value that the data product will generate over its lifetime. The lifetime of a data product typically being at least 5 years.

Examples include tangible value such as:

Revenue growth
Cost savings
Productivity

But also more intangible sources of value to the organisation such as:

Risk reduction
Strategic advantages
Protecting brand / reputation
Enhancing customer experience
Employee satisfaction and retention

The most successful data products often start by delivering small but immediate value, then expand their impact over time as adoption grows and capabilities evolve.

Tailoring the value drivers in this section to the goals of the organisation is a useful way of ensuring that data product ideas are tested against and aligned to the business strategy.

Data Sources

Identifies the specific data sources required to deliver the actionable insights. Here, you should consider:

Nature of the data source - which will typically be:
- Business application in the operational plane - an application that is used to run the business such as an ERP platform?
- Service - is it an external service that the organisation has access to such as Bloomberg?
- Master and reference data - is it some form of master / reference data that is typically required to augment and integrate operational data?
- Another data product?
Trust - what level of trust do you place in each data source? Data sources with a low level of trust should immediately be a red flag for any data product idea!
Classification - how is the data classified? Is it information that is considered low value / low risk? Or is it high value and / or high risk in nature such as proprietary information about individuals that only your organisation has access to?
Compliance - are there specific policies, regulations and laws that need to be applied? For example GDPR?
Ownership of the data - does the organisation own the data or is it owned by an external party? If internal, which department owns it? Are they prepared to grant access?
Data Characteristics - consider the general nature of the information for each (here we use the "5Vs" framework):
- Volume - the amount of data and how it will grow over time.
- Velocity - the speed at which data changes and needs to be processed.
- Variety: the range of data types that need to be consumed - structured, unstructured and semi-structured.
- Veracity: the quality and accuracy of the data.
- Value: licenses required to use the data.

In all of the above, you are not looking for precise or detailed answers. It's a case of using your expertise to highlight the characteristics that will have a significant influence on the technical complexity, feasibility and overall TCO of the data product. The single page canvas, with limited space, will force you to keep it focused on the most important characteristics.

Data Processing

Defines the key activities that need to performed to transform the data sources into an actionable insight. This includes the following activities and considerations:

Ingestion - how will data be ingested from source?
Quality Assurance - does data need to be validated before it can be used? What "data contract" needs to be fulfilled to enable the data product to be viable?
Processing - what processing is required? Cleaning, standardisation, integration, transformation, filtering?
Projection - in what form should the data be presented for downstream consumption? For example, does it need to transformed into a single flat table, a star schema or a knowledge graph?
Modelling - what form of analytics or modelling processes need to be applied to source data? For example, is data inference being applied to address gaps in data? Is a machine learning model being applied to cluster data or enable predictive analytics?
Automation - what level of automation is required?
Triggers - when / how often should the process run?

Data processing requirements should be driven by the needs of the actionable insight, not by the capabilities of your existing tools or team preferences.

As above, the purpose here is not define the solution in detail but to highlight the major processes that need to be applied to deliver the data product. In future phases, if the data product is taken forward into implementation, more detailed analysis and design may decompose this data product idea down into a number of inter-linked data products (a data mesh). But for now, the key is to identify the major activities. Focus on highlighting activities that will require specific skills, methods and technology platforms to deliver.

Data Skills, Tools and Methods

Outlines the people, technologies and processes that to enable the data processing (defined above) and to sustain the data product. This will include:

Expertise - knowledge, technical skills and other domain related skills required to build, own and operate the data product.
Tools - the specific platforms, technologies, packages and libraries that will be required to support data processing.
DataOps - platforms, tools and practices related to addressing the non-functionality requirements associated with the data product such as testing, observability and source control.
Governance - which policies and principles are relevant to this data product? This is a significant topic in its own right, but it is important to highlight anything that is critical or unique in some way to this data product. Examples include:
- Specific regulations that apply.
- The classification of the data - for example is it highly sensitive personal information? This will fundamentally shape the design of the solution and inflate the TCO!
- Considerations around data retention.
- Business continuity and disaster recovery requirements.
Standards - identify the specific standards that should be adopted by the data product, specifically to enable it to support key features of a data product:
- Re-use by making it addressable and accessible
- Inter-operability with other data products
- Versioning
- Publishing for discovery

As your organisational maturity develops, you should increasingly find that the skills, tools and methods identified will be well established and available as "re-usable IP" that is a native part of the data platform(s) on which you build and operate data products. Where a new skill, tool or method is required, this is likely to present additional cost and risk to the data product.

Total Cost of Ownership (TCO)

Captures all anticipated costs associated with the data product over its entire lifetime. In our experience, data products are likely to have a lifetime of at least 5 years and will incur a significant costs over this time. Some of this cost will be direct and tangible, other costs will be less tangible, but just as important to capture.

For every £/$/€ spent building a data product, organizations typically go on to spend 5X of that maintaining it over its lifetime. Comprehensive TCO analysis prevents painful surprises.

For a deeper dive into TCO, we recommend our series What is total cost of ownership and why is it important?

There are many data teams out there, weighed down by the effort required to sustain legacy data products that are not delivering sufficient value or displacing the bandwidth that could be used to build, own and operate high impact / impact data products. If they had used a tools such as the data product canvas, or retrospectively applied it to identify data products that should now be retired, they could be maximising their contribution to the organisation.

Total cost of ownership is a broad topic but should consider:

Infrastructure and storage costs
Build costs whether the costs are internal resources or outsourced
Support, maintenance and operations, with a view to staying on top of technical debt
Ongoing evolution to add new features
License fees
Data costs - such as paying for external data services
Training and documentation, including bandwidth necessary to keep up with the ever evolving cloud data platform landscape

Strategic Principles for Success

Now that we've covered the nine building blocks in detail, let's examine some key strategic principles that will help you get the most value from using the canvas in practice.

Keep scope as narrow as possible

By design, the Data Product Canvas presents you with a small surface area to force you to focus the data product on a specific goal, audience and actionable insight.

By keeping the scope of the data product as targeted and as narrow in scope as possible, you are more likely to deliver it. It helps you to avoid the trap of monolithic solutions, "analysis paralysis" and multi-year waterfall projects that promise the earth, are often late to deliver and fail to meet their promise.

If you can't fit your idea on a single canvas, it's too big. Break it down into smaller, more targeted data products that can be delivered incrementally.

Establishing clarity of purpose is fundamental in product design as it:

Promotes alignment and effectively decision-making within the team responsible for building the product, helping you to avoid scope creep and un-wanted features
Clear communication helps with "marketing" the product by allowing users to grasp what the product is for
Protect the longevity of the product by simplifying the strategy and road map

If you can't fit your idea on the canvas it's too big. Use the canvas to simplify your initial idea down to the quantum of functionality that will have maximum impact with the least amount of complexity and risk.

Tackle Data Products as a socio-technical endeavour

To succeed, you need to master the cultural, organisational and human aspects as well as the technology in order to be successful in extracting value from data. We often talk about about people, process and technology. As data professionals, we often pay too much attention to the technology part. The data product canvas will help give you early sight of the people, process and technology barriers that may act against the goal(s) you are trying to achieve.

The key to winning hearts and minds is to involve all relevant stakeholders early in the lifecycle, to collaboratively build shared situational awareness, to be transparent about the people and process aspects that are key to success; and to get commitment from all to the transformation. The Data Model Canvas is a tool which supports this approach.

The latest wave of Generative AI is particularly susceptible to this. We have seen examples in the work we have been doing with clients to embrace the latest wave of Generative AI products and services: only the teams that are willing to develop new skills and embrace new ways of working are able to generate value from this exciting new technology.

We find the level of the resistance is often in proportion to that of the scale of transformation required to successfully adopt the new technology. This resistance is often driven by mis-aligned incentives. To quote Upton Sinclair:

"It is difficult to get a person to understand something, when their salary depends upon them not understanding it."

Elevator pitch

The Data Product Canvas provides all of the information you need for your elevator pitch. It covers all of the fundamentals: the purpose (what problem is it solving?), who it's target users are and the value it is seeking to generate.

It is extremely useful when you are required to "sell" the idea to senior stakeholders who may need to be convinced to committing budget and / or resources to it. It is the mental model you can use to describe the data product. If you do choose to share with the stakeholder, we wouldn't recommend walking through it line by line. Describe the principle of the canvas and allow the stakeholder to then explore it and ask questions. You may be surprised to find that they have already come across the Business Model Canvas or simply "grok it".

Conclusion

The Data Product Canvas transforms how organizations approach data products by shifting focus from technology-first to purpose-driven design. By working through these nine building blocks in a collaborative, iterative process, teams can identify winning opportunities, eliminate costly missteps, and ensure alignment between business value and technical implementation. This upfront investment of time saves organizations substantial resources by validating ideas before committing to development. Start using the canvas today to dramatically improve how you conceptualize, communicate, and deliver data products.

The Data Product Canvas: Stop Building Data Products That Fail

Barry Smart — Fri, 17 Oct 2025 05:30:00 GMT

TLDR; The Data Product Canvas adapts the Business Model Canvas and Data Mesh frameworks to help teams collaboratively envisage, evaluate and iterate on a data product idea before committing significant resources to it.

Data products often fail not due to technical limitations, but because they don't solve real business problems or aren't operationally sustainable.

The symptoms are:

Misalignment - solutions looking for problems rather than addressing real pain points, analytics that don't drive actual decisions or actions, insights that arrive too late to be valuable.
High costs, low value - high maintenance costs for low-value solutions, expensive licenses for underutilized tools.
Strategic stagnation - businesses that are unable to respond quickly to market changes, unable to capitalize on AI/ML advances and falling behind the competition.
Lack of engagement - users reverting to old ways of working, lack of skills and incentives to effectively use new tools, unclear ownership and accountability.

In our experience, success is most often driven by the following factors, listed here in descending order of impact:

Clear Value Proposition - data products with a clear purpose.
Quality Data - solid data foundations with active stewardship.
Leadership - who set the vision, drive the roadmap and inspire change.
Stakeholder Alignment - key decision-makers and influencers are united behind the vision.
Champions - who lead by example when it comes to adoption of new tools and ways of working.
Change Support - training and process redesign smooths the path from old to new ways of working.
User-First Design - products that are accessible to all, simple to master, and impossible to live without.
ROI Impact - a clear understanding of the lifetime costs relative to lifetime value.
Technical Expertise - access to the skills and knowledge required to design, build and sustain a data product over its lifetime.
Technical Feasibility - access to the technologies required to develop the product.

The path to success is clear: start with purpose, build on quality data, and create an environment where users thrive. Get these right, and technical implementation becomes the easier part of the journey.

Introducing the data product canvas

The Data Product Canvas is a blueprint for success - a simple yet powerful tool that aims to bring all of the stakeholders together to shape data products that deliver real value.

It does this by presenting 9 simple building blocks that will encourage you to think holistically about the data product.

Audience - the specific groups of people that the data product is aiming to create value for.
Actionable Insight - the data driven actionable intelligence that will be delivered by the data product to allow the audience to achieve a specific goal.
Consumption - the means through which the audience will access and use the data product.
Adoption - the support that will be given to the audience to enable them to successfully discover and use the data product.
Lifetime Value - the value (tangible and intangible) that the data product is aiming to deliver over its lifetime.
Data Sources - the data sources which are required to deliver the actionable insight.
Data Processing - the actions that will need to be taken to transform data sources into the actionable insight.
Data Skills, Tools and Methods - the key capabilities that will be required to deliver and sustain the data product over its lifetime.
Total Cost of Ownership - the projected costs to design, build, test, operate, maintain and evolve the data product over its lifetime.

If you would like an Adobe Acrobat version of the canvas, please reach out to us.

The Data Product Canvas draws inspiration from two areas of thought leadership:

Business Model Canvas - a visual tool for designing, exploring and iterating on new strategies and product ideas.
Data Mesh - specifically the concept of "data as a product", which is one of the 4 key principles applied in a Data Mesh architecture.

Feedback Loops

The super power of this canvas is that it enables positive feedback loops: one of the unsung strengths of the Agile approach.

The canvas is primarily intended for use early in the lifecycle before any effort or budget is consumed on implementing the data product. It achieves this by holding a mirror up those who are considering the idea, with the objective of allowing them to dispassionately evaluate the feasibility and to highlight in advance the key barriers they will need to overcome to make the data product a success.

The process of completing the canvas is iterative in nature. It will require you to conduct short conversations with different stakeholders, with each conversation allowing you to illuminate a specific area of the canvas, uncover new information or raise questions that require you loop back and re-consider areas of the canvas you may already have populated.

At some point in completing the canvas, you may need to fundamentally re-think the idea (a practice often referred to as a "pivot") or decide that the idea simply isn't feasible (i.e. to "fail fast"). Either is a positive outcome, as it is better to do this whilst the idea is still on paper rather than further down the lifecycle when significant budget has been expended!

The cheapest thing you can do it talk, the most expensive thing you can do is code!

Looking Ahead: Making Data Products Work

The technology landscape has transformed dramatically in the last decade. The pandemic forced many organisations to make decisions from data rather than from gut feel face-to-face interactions. The more recent emergence of Generative AI has sparked fresh enthusiasm for data-driven transformation.

The key challenge is choosing the right opportunities and understanding what is necessary to implement them successfully as a socio-technical endeavour. Organisations face several critical questions:

How do we prioritize competing ideas for data products?
Where will our investments have the most significant impact?
How do we ensure successful implementation?
How do we align stakeholders and resources effectively?
Do we have the necessary skills and knowledge to deliver this on time, within budget?

The Data Product Canvas offers a framework to answer these questions. By encouraging you to take a collaborative approach to envisioning, evaluating, and iterating on data product ideas, it helps teams:

Identify high-impact opportunities
Anticipate implementation challenges early
Consider cultural and organizational factors
Build stakeholder alignment
Make informed investment decisions

Organizations today face a critical disconnect: soaring ambitions for AI and data transformation, but earthbound budgets and resources to achieve them. It's the classic "champagne ambitions versus sparkling water budget" dilemma, but with high-stakes consequences for business competitiveness.

Without a structured approach to evaluating data initiatives, organisations risk wasting limited time, resources on budget on failed data products. The Data Product Canvas is a useful tool for preventing expensive mistakes and identifying winners before you invest. In an era where every failed data project means lost market share and competitive advantage, organizations can't afford the traditional "build it and they will come" approach.

Building data quality into Microsoft Fabric

Mike Evans-Larah — Wed, 15 Oct 2025 23:00:00 GMT

Data quality issues are one of the biggest silent killers of analytics initiatives. Teams invest significant time and resources into building dashboards and reports, only to discover their data pipeline was feeding them incorrect information all along.

The hidden cost of poor data quality

We’ve seen this play out in many organizations: ETL jobs run without errors, reports refresh on schedule, and stakeholders confidently act on the data in front of them - until someone takes a closer look and uncovers a critical flaw in the underlying data.

The impact goes beyond just incorrect numbers. Poor data quality erodes trust in your analytics platform, creates expensive firefighting exercises, and can lead to costly business decisions based on faulty information.

We've learned that traditional approaches to data quality - where you build first and validate later - simply don't work. By the time you discover quality issues, they've already propagated through your entire analytics ecosystem.

A validation-first approach

The solution isn't just better testing (although testing forms an important part of the story too - see James Broome's talk on How to ensure quality and avoid inaccuracies in your data insights) - it's rethinking how to approach data quality from the ground up. We use a "validation-first" mindset, where data quality checks become first-class citizens in the pipeline design.

Here are four strategic principles we follow when building data quality into Fabric implementations:

Principle 1: Validate early and often in your pipeline design

The earlier you catch data quality issues, the cheaper they are to fix. We weave validation checks throughout the entire data pipeline.

In Fabric, this means building validation logic directly into your data engineering pipelines using notebooks and dataflows. Instead of waiting until data reaches your lakehouse, we validate at multiple checkpoints:

Source validation: Check data quality as it enters Fabric from external systems
Transformation validation: Verify data integrity after each major transformation step
Business rule validation: Ensure data meets your organization's specific requirements
Output validation: Final checks before data reaches consumption layers

This multi-layered approach aligns well with medallion architecture principles, where data quality is maintained at each stage of the data lifecycle. For a deep-dive of how this can look in practice, see Creating Quality Gates in the Medallion Architecture with Pandera.

Principle 2: Build rich, actionable error reporting from the start

Generic error messages like "data validation failed" are useless when an urgent issue arises in production. We design our validation systems to provide rich, contextual information that helps teams quickly identify and resolve problems.

This means building validation reports that are user-friendly and include details like: specific error descriptions, which records are problematic and why, and suggested next steps for resolution.

These can be shared in the form of HTML-based emails, Power BI reports, or custom web applications, depending on the needs of the team.

NOTE: Be wary to ensure that any sensitive information is appropriately redacted when sharing validation reports.

Principle 3: Create feedback loops that teams actually use

The best validation system in the world is worthless if teams ignore the alerts. We design feedback systems that integrate naturally into how teams already work.

Alerts should arrive in the right channel at the right time (and be directed at the right people). Combined with rich, contextual information in the validation reports, this ensures that teams have everything they need to take action.

Fabric's integration with Microsoft Teams and email systems makes it easy to create notification workflows that fit into existing team communication patterns.

Principle 4: Integrate quality metrics into your monitoring strategy

Data quality shouldn't be an afterthought in your monitoring approach - it should be a core metric alongside performance and availability. Validation results should be captured and analyzed just like any other operational metric.

Fabric's built-in workspace monitoring capabilities, combined with Power BI dashboards, enable teams to track data quality metrics alongside other operational KPIs and visualize trends over time.

Steps can be taken to automate the collection and reporting of these metrics, ensuring that data quality remains a top priority.

The results

Implementing these principles in Fabric environments, you can expect:

Faster issue resolution: Teams can identify and fix data problems in minutes rather than hours or days
Increased stakeholder confidence: Business users trust the data because they see consistent quality
Reduced firefighting: Fewer emergency meetings about "why the numbers are wrong"

Perhaps most importantly, teams can feel more confident about their analytics outputs. When you know your data quality processes are robust, you can focus on deriving insights rather than constantly second-guessing your numbers.

Top Features of Notebooks in Microsoft Fabric

Jessica Hill — Wed, 15 Oct 2025 06:30:00 GMT

The notebook experience in Microsoft Fabric is similar to notebook experiences on other platforms - they provide an interactive and collaborative environment where you can combine code, output, and documentation for data exploration and processing. However, there are a number of key features that set Fabric notebooks apart. I'll walk through the top features of Fabric notebooks in this blog post.

Native Integration with Lakehouses

Fabric notebooks are natively integrated with your lakehouses in Fabric. You can mount a new or existing lakehouse directly into your Fabric notebook simply by using the 'Lakehouse Explorer' in the notebook interface. The Lakehouse Explorer automatically detects all of the tables and files stored within your lakehouse which you can then browse and load directly into your notebook. This direct integration with your lakehouses eliminates any need for manual paths / set-up, making it simple and intuitive to explore your lakehouse data from your Fabric notebook.

Built-in File System with Notebook Resources

Fabric notebooks also come with a built-in file system called notebook 'Resources', allowing you to store small files - like code modules, CSVs and images etc. The notebook 'Resources Explorer' acts like a local file system within the notebook environment, you can manage folders and files here just like you would on your local machine. Within your notebook, you can then read from or write to the built-in file system. The files stored in the file system are tied to the notebook itself, and are separate from OneLake. This is useful for when you want to store files temporarily to perform quick experiments or ad hoc analysis of data / scripts. Or, if you want to just simply store notebook-specific assets.

Drag-and-Drop Data Exploration with Data Wrangler

Fabric notebooks also have a built in feature called the 'Data Wrangler' which allows you to use the notebook interface to drag-and-drop files from your lakehouse / in-built file system into your notebook and load the data, all without writing any code. After dropping the file into the notebook, the data wrangler autogenerates the code needed to query and load the data. This low-code experience, simplifies data loading and lowers the barrier to entry to get started with data exploration. You don't need any coding experience to simply just load your data into your Fabric notebook.

AI Assistance with Copilot

Copilot for Data Science and Data Engineering (Preview) is an AI assistant within Fabric notebooks that helps you to analyse and visualise your data. You can ask Copilot to provide insights on your data, generate code for data transformations or to build visualisations.

In Fabric notebooks, you can access Copilot by using the Copilot chat panel. Here you can ask questions like "Show me the top 10 products by sales", "Show me a bar chart of sales by product" or "Generate code to remove duplicates from this dataframe". Copilot will respond with either natural language explanations or will generate the relevant code snippets that you can copy and paste into your notebook to execute. You can also ask the Copilot chat to provide natural language explanations of notebook cells, and add markdown comments, helping you to understand and document your code. This makes data exploration more accessible, especially for those with a lower level of coding knowledge.

Alongside the Copilot chat, you can also interact with Copilot directly within your notebook cells by using the Copilot in-cell panel. Here you can make requests to Copilot and it will provide the necessary code snippet directly in the cell below.

Also built into Fabric notebooks is Copilot's AI-driven inline code completion. Copilot generates code suggestions as you type based on your notebook's context using a model trained on millions of lines of code. This feature minimises syntax errors and helps you to write code more efficiently in your notebooks, accelerating notebook development.

You can also use Copilot to add comments, fix errors, or debug your Fabric notebook code by using Copilot's chat-magics. These are a set of IPython magic commands, that help you to interact with Copilot. For example, placing the %%add_comments magic command above a cell prompts Copilot to annotate the code with a meaningful explanation. Similarly, the %%fix_errors command analyses code and suggests corrections inline.

Having spent time working with Copilot in Fabric notebooks, I've found the main advantage is that is that it saves time. Even if the output needs tweaking, it saves time and effort upfront by doing the bulk of the ground work. This is especially true for tasks that don't require deep contextual understanding or complex decision-making, for example reading/writing data, creating schemas from dataframes, renaming columns, basic transformations etc. Or for tasks where is already a pattern in place within your code base, as you can ask Copilot to base its output on that, and it's generally accurate. I've also found that it's good at debugging your code and can spot things that are not always obvious, and it's pretty good at generating documentation too, all of which also saves time and effort.

However, Copilot doesn't always fully understand the context or intent behind your work. This is especially true for more complex tasks. Or, sometimes it might suggest code that is unoptimised. This is why you can't be fully reliant on Copilot's suggestions, you still need to review and refine what it generates. That said, even if the output isn't exactly what you need, it is often along the right tracks and can give you inspiration to get started. On the whole it's useful to get unblocked and speed up routine tasks, but it should be used as a tool to assist you, whilst you stay in control of the decision making behind your code. It is also worth noting that most of the Fabric notebooks Copilot features described above are currently in Preview.

Faster Spark Start-Up

With the Spark-based Fabric notebooks, it is generally very quick to spin up a Spark session. If you have ever used notebooks in Azure Synapse before, you will know it takes a few minutes to spin up a Spark session. However, with Fabric notebooks, it takes a matter of seconds. The fast start up times for Spark sessions is due to Fabric's starter pool model, which keeps a lightweight Spark runtime ready to serve new sessions. This means when you initiate a Spark job, it can attach your session to an already running pool and it doesn't need to provision a full cluster from scratch.

If you're running Spark sessions anywhere else in your tenant, the Spark runtime should start very quickly. This is because Fabric re-uses active sessions across the tenant. If any Spark session is already active within your tenant, your notebook can essentially piggyback on that runtime, allowing it to start in seconds. However, it is worth noting that if it's the first time you've run a Spark job in a while, it will take slightly longer to spin up a Spark session.

Python Notebooks

Python notebooks in Fabric offer a pure Python coding environment without Spark. They run on a single-node cluster (2 vCores / 16 GB RAM) making them a cost-effective tool for processing small to medium sized datasets where distributed computing is not required. Using the Apache Spark engine for small datasets can get expensive and is often overkill. Depending on your workload size and complexity, Python Notebooks in Fabric may be a more cost-efficient option than using the Spark-based notebook experience in Fabric.

The Python notebook runtime comes pre-installed with libraries like delta-rs and DuckDB (See Barry Smart's 3-part series on DuckDB) for reading and writing Delta Lake data, as well as the Polars and Pandas libraries for fast, data manipulation and analysis. This environment is ideal for those who want to leverage these libraries without additional setup. These libraries are not available by default in PySpark notebooks in Fabric, meaning that you would need to manually install and configure them to access similar functionality. For workflows that benefit from these specific libraries, Python notebooks offer a more ready-to-go experience.

Integration with Power BI Semantic Models through Semantic Link

The final key feature of Fabric notebooks that this blog post is going to cover is their integration with Power BI semantic models through Semantic Link. Semantic Link is a feature in Fabric that connects Power BI semantic models with Fabric notebooks. It enables the propagation of semantic information - like relationships and hierarchies from Power BI into Fabric notebooks.

Fabric notebooks also have access to Semantic Link Labs, which is an open-source Python library built on top of Semantic Link, which contains over 450 functions that enable you to programmatically manage semantic models and Power BI reports all from within Fabric notebooks. You can do things like rebinding reports to new models, detecting broken visuals, saving reports as .pbip files for version control or deploying semantic models across multiple workspaces with consistent governance.

Python notebooks in Fabric also offer support for the SemPy library. This is another Python library built on top of Semantic Link that enables you to interact with Power BI semantic models using pandas-like operations (but it's not actually pandas under the hood). SemPy introduces a custom object called FabricDataFrame, which behaves similarly to a pandas dataframe but it is semantically enriched. This means it carries metadata from Power BI semantic models - like relationships, hierarchies, and column descriptions. SemPy supports operations like slicing, merging and concatenating dataframes whilst preserving these semantic annotations. This means that you can explore and transform your data, with semantic awareness maintained.

Another key feature of the SemPy library is the ability to retrieve and evaluate DAX measures from your Power BI semantic models. For example, you can use SemPy to retrieve DAX measures like "Total Sales" from your semantic model. Similarly, with SemPy, you can also write new DAX expressions and evaluate them within your notebook.

This means you can use business logic, like calculated KPIs or aggregations, defined in Power BI directly in your notebooks, without needing to reimplement the logic manually in Python. Using business logic already defined in Power BI, directly in your Fabric notebooks, reduces duplication and ensures consistency. It also promotes collaboration between data scientists working in Fabric notebooks and business analysts working in Power BI - as both are using a shared semantic layer.

Note that all notebook experiences in Fabric support Semantic Link, but only the Python notebook experience in Fabric offers support for the SemPy library.

Conclusion

Fabric notebooks offer a lot of great features. You can access your lakehouse data easily through the Lakehouse Explorer and store your notebook-specific assets in the built-in file system. The drag-and-drop experience with Data Wrangler makes data exploration accessible, and Copilot provides AI assistance for writing, documenting, and debugging code as well as generating insights from your data. Spark sessions are quick to start up due to Fabric's starter pool model, which makes distributed processing faster than platforms like Azure Synapse.

Python notebooks provide a lightweight, cost-effective alternative for smaller workloads, and come pre-installed with libraries like Polars, DuckDB, and delta-rs - providing a ready-to-use environment for your analytics. Finally, the integration with Power BI semantic models through Semantic Link, Semantic Link Labs, and SemPy allow you to interact with semantic models programmatically, apply DAX measures directly in notebooks, and maintain semantic integrity across transformations. This shared semantic layer promotes collaboration between data scientists and business analysts, which ensures consistency and reduces duplication of logic across platforms.

Whilst these are the top features that stood out to me, there are also lots of other capabilities within Fabric notebooks, so do go and check them out yourself. My colleague Ed has produced a great YouTube video series on getting started with Fabric notebooks, including Microsoft Fabric: Processing Bronze to Silver using Fabric Notebooks and Microsoft Fabric: Good Notebook Development Practices.

How .NET 10.0 boosted JSON Schema performance by 18%

Matthew Adams — Mon, 13 Oct 2025 04:03:00 GMT

At endjin, we maintain Corvus.JsonSchema, an open source high-performance library for serialization and validation of JSON using JSON Schema.

Its first release was on .NET 7.0, and its performance was pretty impressive. Ian Griffiths has given a number of talks on the techniques it uses to achieve its performance goals.

Since then, the .NET 8.0 runtime shipped, and with no code changes at all, we got a "free" performance boost of ~20%.

And then, when .NET 9.0 shipped, we got a further 32% performance boost, with our new Corvus.JsonSchema V4 codebase.

Now, with the .NET 10.0 release candidate available, we can see another significant boost: ~18% "for free" just for switching out the runtime.

If you have a LTS-based upgrade cycle and you are switching from .NET 8 to .NET 10, you'll see a pretty amazing 29% improvement.

Here are the details.

The Benchmark

We run the benchmark on a 13th Gen Intel Core i7-13800H 2.90GHz, 1 CPU, 20 logical and 14 physical cores.

The dotnet versions are as follows: .NET 10.0.0 (10.0.0-rc.1.25451.107, 10.0.25.45207), X64 RyuJIT x86-64-v3 .NET 8.0.20 (8.0.20, 8.0.2025.41914), X64 RyuJIT x86-64-v3 .NET 9.0.9 (9.0.9, 9.0.925.41916), X64 RyuJIT x86-64-v3

Validating a large array consisting of 10,000 small JSON documents. This is typical of a small JSON payload in a web API. It includes some strings, some formatted strings (e.g. email, date), and some numeric values.

{
    "name": {
        "familyName": "Oldroyd",
        "givenName": "Michael",
        "otherNames": [],
        "email": "michael.oldryoyd@contoso.com"
    },
    "dateOfBirth": "1944-07-14",
    "netWorth": 1234567890.1234567891,
    "height": 1.8
}

Method	Runtime	Mean	Ratio	Allocated
ValidateLargeArrayCorvusV4	.NET 8.0	9.804 ms	1.00	-
ValidateLargeArrayCorvusV4	.NET 9.0	8.447 ms	0.86	-
ValidateLargeArrayCorvusV4	.NET 10.0	6.913 ms	0.71	-

Notes on 7.0

.NET 7.0 is no longer supported, and we have dropped it from the benchmarks.

Is this the end of the free lunch?

As always, we question whether we would be seeing the end of the free lunch. Last year, we figured that we would still get improvements from .NET 10 and we were right. I don't think we've reached the end of the road yet.

Looking at the future

We are currently working on vNext of Corvus.JsonSchema. It has some significant changes in the code generator, and produces code which offers 2-10x improvement in validation efficiency. This is likely to land in the .NET 10 timeframe - and will no doubt take advantage of optimizations available in .NET 11!

Big Data London 2025

Barry Smart — Thu, 09 Oct 2025 05:30:00 GMT

TLDR; Big Data LDN returned as the UK's largest free data and AI event, drawing 15,000+ professionals and 150+ exhibitors to Olympia in South West London. This year's event was dominated by AI agents. Organizations demonstrating real AI success had done the foundational work first. For senior leaders making platform decisions now, the message is clear: the strategic imperative isn't choosing between innovation and governance; it's recognizing that governance enables innovation at scale.

The Industry's Evolution: Five Years in Focus

When I was a CTO, Big Data LDN was exactly the type of conference that I would make a point of attending. With all of the major vendors represented at the event, it provides a rare opportunity to bring yourself up to date with the data and AI landscape. CTOs and other senior decision makers continue to make important decisions about choice of platforms, architectures, methodology and skills for the future—this conference can really help to shape and validate those decisions.

One of the most fascinating aspects of attending Big Data LDN over the last 5 years is watching it serve as a barometer for where the industry's collective attention has shifted. Each annual edition captures a different phase in the evolution of the data landscape:

2021: the Lakehouse emerged as the solution to bridging data lakes and warehouses. This platform along with the Medallion Architecture became the foundation for unifying analytics and AI workloads.
2022: Data Mesh dominated—how do we decentralize data ownership while maintaining coherence. Data Products emerged as the organizing principle. Eli's blog provides deeper insights into the 2022 event.
2023: The ChatGPT revolution hit the conference hard. The focus was very much on use of Generative AI to spearhead innovation.
2024: The focus shifted to how organizations could leverage GenAI while building the proper foundations. A realisation that AI is only as smart as the data it's fed.
2025: AI Agents took center stage promising new use cases and accelerating the democratisation of data and analytics.

The progression tells a story. We've moved up the stack from infrastructure debates (where should data live?) through architectural patterns (how should we organize it?) to operational models (how should we manage it?) and now to intelligence layer (how should AI interact with it?).

But here's what's crucial: each layer builds on the previous ones. Organizations jumping straight to AI agents without solid infrastructure, governance, and product thinking will struggle.

The success stories at this year's event all shared one characteristic: they had done the foundational work first.

A packed event

Here are some photos to give you a sense of scale of the event.

The vast exhibition space houses 150+ vendor stands, each competing for attention. Multiple theatre spaces are curtained off but not soundproofed, creating a cacophony where live demonstrations blend with theatre presentations blend with hundreds of simultaneous conversations.

There were parallel tracks in progress across 17 theatres. It was difficult to choose sometimes which session to attend! Some theatres used noise cancelling headphones to help cut through the noise!

Google's stand was very popular—you could design and print your own custom travel tag with the help of AI, there was also a lego builders competition with live commentary and final winner judged by an AI:

Plan ahead to identify sessions you'd like to attend and exhibitors you'd like to visit. Popular sessions filled up 30 minutes early, so I pivoted to focus on the expo floor where I could control my experience. The expo floor offers interactions you can't replicate later, and the talks will be available on YouTube in 2-3 weeks. Subscribe to the Big Data LDN YouTube channel for notifications as these videos drop!

Agents everywhere!

If there was one dominant theme threading through Big Data LDN 2025, it was AI agents. But we need to distinguish between two very different approaches that signal fundamentally different bets about the future:

The Integration Play

Most established vendors are following what we might call the "AI augmentation" strategy—integrating generative AI capabilities into existing platforms. The core product remains familiar, but now it has AI-powered assistance for common tasks. This makes perfect sense for companies with well established workflows that are fundamentally sound and just need AI assistance. Why force users to learn entirely new paradigms when you can make their current work more intelligent?

The Agent-First Approach

More intriguing were the handful of new entrants building products that are fundamentally agentic from the ground up. These aren't traditional BI tools with ChatGPT bolted on—they're platforms designed around the assumption that AI agents will be the primary interface between users and data.

Companies like Lynk AI exemplify this approach. Rather than asking "how do we add AI to our existing product?", they're asking "if we put AI agents at the heart of the data value chain, what should the architecture look like?" This agent-first approach assumes AI will fundamentally restructure how we interact with data.

Whether these emerging companies can disrupt the established mega vendors remains to be seen. But it was exciting to see the innovation and the focus being placed onto new capabilities which may not have had the emphasis historically.

Semantic Layer: The Foundation for Agentic Analytics

This brings us to what might be the most important architectural conversation happening right now: the semantic layer.

The semantic layer sits between your curated data and your AI. It provides the context and meaning over that data to help the AI navigate knowledge successfully to provide the right answer.

Without a robust semantic layer, AI agents are essentially guessing about what your data means. It's like asking someone to translate a document when they don't speak the language it's written in—they might produce something, but it won't be trustworthy.

The semantic layer enables:

Consistent definitions - What does "customer" mean within and across different domains?
Business context - How do metrics relate to business outcomes?
Governed logic - Ensuring AI agents use approved calculation methods such as how to derive "gross margin".
Explainable results - Tracing how answers were generated, with supporting lineage and citations back to ground truth.

Companies like Lynk AI are positioning this as the critical infrastructure for trustworthy agentic analytics. In our work with enterprise clients struggling to implement AI on messy data, this emphasis is spot-on.

Vendors to watch: Innovation Beyond the Mega Platforms

Three of the "big four" mega vendors (Databricks, Snowflake, Google) had large spaces and were extremely busy. They had the space to run their own demos and presentations.

I was surprised to find that Microsoft, the final of the "big four", had no presence in the exhibition hall. Numerous exhibitors that partner with Microsoft had a presence. But Microsoft themselves? Absent. This is puzzling.

Microsoft had recently held their own Fabric Conference (FabCon) in Vienna: my colleagues Carmel and Jess attended the event and wrote up a great summary. It's great that FabCon exists, but this is an event for people who are already committed to Fabric: they're willing to pay for the conference and will get a lot out of it. Whereas Big Data LDN allows CTOs and other senior decision makers a single (free!) event to assess the whole ecosystem—if you're not in the room, you're not part of the conversation. Given the competitive intensity in the data platform space, with Snowflake, Google and Databricks all having significant presence and actively supporting their wider partner ecosystem, Microsoft's absence feels like a missed opportunity.

Away from the big 4 and their respective ecosystems, some of the most intriguing innovation was happening in "Discovery Zone" where smaller niche vendors could be found. Based on conversations and demonstrations, here are three emerging vendors addressing real pain points with differentiated approaches:

Lynk AI: Semantic Layer for Agentic Analytics

As discussed earlier, the semantic layer provides essential context for AI agents. Lynk AI is building a semantic layer platform specifically designed for AI agents, not as an afterthought to traditional BI. Their focus on creating a "semantic graph" that gives AI context about your data relationships addresses the fundamental trust problem in agentic analytics.

MotherDuck: DuckDB for Collaboration at Scale

DuckDB has become the darling of the data engineering community for its speed and simplicity. See my blog series on DuckDB and also on DuckLake for more background about this exciting new analytics engine. MotherDuck takes the open-source DuckDB engine and wraps it with the operational capabilities enterprises need: cloud-native scaling, team collaboration, and managed operations. If you're exploring alternatives to traditional data warehouses for analytical workloads, this is worth evaluating. The platform is able to leverage the cutting edge in-process analytics engine provided by DuckDB to run blazingly fast queries at a TCO significantly lower than its competitors.

dltHub: Python-Native Data Loading

dltHub offers a Python library and platform for data loading that's built for the way modern data teams actually work. Instead of configuring low-code tools, you write Python. This resonates with the growing population of Python-first data engineers who find traditional ETL tools too rigid. This focus on DevEx (developer experience) becomes increasingly important as organizations aim to accelerate time to value, avoid vendor lock-in and minimize technical debt.

The Pattern: These three vendors exemplify a broader pattern emerging across the ecosystem—they're not trying to replace your entire platform. They're solving specific, high-value problems that mega vendors either overlook or address as afterthoughts. The strategic question for data leaders: when does the innovation and fit of a specialized vendor outweigh the simplicity of staying entirely within one mega platform's ecosystem?

Based on our experience, the answer is usually "hybrid"—use a mega platform for core capabilities, then integrate specialized vendors where they add meaningful value. The key is ensuring your chosen mega platform has strong commitment to open standards (Delta Lake, Iceberg, Arrow) that enable this flexibility.

Beyond Technology: The Socio-Technical Reality

One of the most encouraging observations from this year's event: growing recognition that data transformation is a socio-technical endeavour. Success requires alignment of people, processes, and technology.

People: Skills Are the Bottleneck

The prominence of training and education providers reflects the reality that skills, not tools or platforms, constrain progress.

Organizations face a critical choice: hire expensively for scarce external talent, or invest in developing internal capability. Our client engagements consistently demonstrate that building your own data talent yields better long-term results. These people already understand your business, culture, and challenges.

The prominence of training providers and career-switchers at the event reflects the recognition that people capability is the real constraint, and focusing on this as a primary concern is an opportunity for organizations.

Processes: Governance as Enablement

The strong showing of governance vendors reflects painful lessons learned. The 25+ vendors in the Data Governance & Quality category exist because enterprises have learned that data sprawl without governance creates expensive problems: compliance violations, incorrect decisions, duplicated effort, lost trust.

The most effective implementations position governance as enablement—guardrails that let people move faster safely, not gates that stop movement.

Technology: The Amplifier, Not the Solution

With people and processes in place, technology becomes the amplifier rather than the solution. The most compelling vendor demonstrations weren't about technical capabilities; they were about how technology enables better collaboration, faster decision-making, and trusted outcomes.

Evaluate your transformation through all three dimensions simultaneously. If you're investing heavily in technology but not addressing skills development and governance processes, you're setting yourself up for expensive failures.

Regulatory Intelligence: Governance Gets Strategic

The collaborative panel featuring ICO, FCA, and OFCOM—three major UK regulators discussing their roles in an AI-enabled world—signals an important shift.

Rather than regulators reacting to problems after they occur, they're proactively positioning how to think about AI governance, data protection, and consumer safety. This matters because proactive regulatory alignment is becoming a competitive advantage. While competitors scramble with remediation, organizations that build with compliance from the start move faster and with greater confidence.

The regulators aren't trying to stop innovation—they're trying to ensure it happens responsibly. Organizations that proactively embrace regulatory frameworks will move with confidence while competitors struggle with remediation.

Regulatory compliance is increasingly a competitive differentiator. Organizations in highly regulated sectors should prioritize platforms and practices that help them meet these requirements rather than creating additional compliance headaches. Factor regulatory positioning into your platform selection criteria now.

Three Strategic Imperatives from Big Data LDN 2025

Invest in Foundations Before Racing to Agents - The organizations demonstrating real AI success had established semantic layers, governance frameworks, and data product practices first. Don't skip generations of maturity—the excitement around AI agents is justified, but the differentiator will be those who deploy them on solid foundations.
Solve for People and Process, Not Just Technology - Technology platforms are maturing rapidly, and open standards (Delta Lake, Iceberg) reduce vendor lock-in risks. But people capability remains the constraint. Focus on building internal data literacy and establishing governance as enablement, not bureaucracy.
Focus on Patterns and Practices, Not Specific Tools - Many organizations are in a position to skip generations of technology, jumping from legacy systems to modern cloud-native architectures. But with limited hands-on experience, technology selection feels daunting. By focusing on patterns (data as a product) and practices (DataOps), you maintain flexibility to transition between platforms as needs evolve.

Final Thought

Big Data LDN 2025 demonstrated that the industry continues to evolve at pace. The most valuable discussions weren't about which platform has the fastest query performance or the most connectors. They were about how organizations successfully navigate the socio-technical complexity of data transformation, how they build trustworthy AI on solid foundations, and how they develop the organizational capabilities that technology alone can never provide.

For senior leaders making strategic decisions now: the opportunity is significant, but so is the risk of rushing ahead without proper foundations. The winners will be those who recognize that AI agents represent the next chapter, not the entire story, and that the story requires careful authoring across people, process, and technology.

Ready to assess your organization's readiness for agentic intelligence? Consider where you are across the three imperatives above. The organizations thriving in 2025 started this foundational work in 2022-2023. Don't wait for the next Big Data LDN to realize you're behind.

Note: Big Data LDN records and publishes theatre sessions to YouTube in the weeks following the event. If you missed sessions you prioritized, they'll be available for thoughtful consumption soon. Also consider smaller community-driven events like DATA:Scotland, Data Science Festival, and SQLbits (on the Saturday) which offer free attendance with a less overwhelming scale.

FabCon Vienna 2025: Day 3

Carmel Eve — Wed, 08 Oct 2025 05:15:00 GMT

After a busy start to FabCon Vienna, day 3 continued with a focus on practical sessions and technical updates. The sessions covered migration from Azure Synapse, platform integration, and performance improvements.

Accelerating Fabric Migration: New Assistant Tools for Data Engineering and Warehousing

(Jenny Jiang - Microsoft, Charles Webb - Microsoft)

The morning kicked off with a session on Synapse to Fabric migration. The new Migration Assistant looks useful. It can migrate Spark pools, notebooks, job definitions, and lake databases. But, it is worth keeping in mind that it doesn't touch pipelines, which is still going to be a bit of a headache for anyone with complex ETL flows!

Powering AI Foundry Agents with Azure AI Search

(Farzad Sunavala - Microsoft)

This session was a deep dive into Azure AI Foundry and how it integrates with Azure AI Search for RAG (Retrieval Augmented Generation) scenarios.

Azure AI Foundry is best used for more complex business processes
Azure AI Foundry Agent Service - connects the core pieces of Azure Foundry into a single runtime, combining:
- Azure AI Foundry Models (with most people gravitating towards OpenAI models)
- Tools like Azure AI Search, Azure Machine Learning, and Azure AI Services.
- Content filters and enterprise security
- Observability through tracing modules and evaluation frameworks.

The session focused heavily on Azure AI Search and RAG patterns.

RAG allows you to ground LLM responses in your own data rather than relying solely on pre-trained knowledge. When a user asks a question, the data store is searched, and the results are combined with the original question to generate contextually relevant answers.

RAG can be used to tackle harder questions from humans and agents, such as:

Producing multi-stage queries
Achieving the aim of "natural language to SQL"

They outlined the RAG framework pattern with three main components:

Data Pipeline

This is probably the most critical step to get right - requiring clean, standardised data across multiple sources.

Ingest - Multiple different data sources unified (can use OneLake shortcuts and mirroring).
Extract - Parse raw documents from tables, images, pdfs into a usable format.
Chunk - Split large documents into smaller segments to fit context windows.
Embed - Convert chunks into vector embeddings.
Index - Store embeddings, processed data and enriched content for efficient querying.

Query / Retrieval Pipeline

The query/retrieval pipeline was broken down into several key steps:

Transform query – The LLM optimises the raw user input, making it more effective for search.
Retrieve – The system searches for relevant information based on the optimised query. This stage includes:
- Vector Search
- Keyword Search - Which is sometimes required when vector search is not sufficient.
- And, Agentic Retrieval.
Re-ranking – Results are reordered using semantic ranking to ensure the most relevant items appear first.
Generate response – The LLM uses the retrieved information and prompt to generate a contextual answer.

The true strength of this pattern lies in its advanced retrieval pipelines, which include two key stages:

Query pre-processing: Involves query planning to optimise how searches are executed.
Retrieval: Utilises vector search, keyword search, and agentic retrieval methods.

Agentic Retrieval: Enabled by a new API within Azure AI Search, agentic retrieval applies techniques such as:

Query planning
Fan-out query execution
Results merging
And, within a single LLM call, it can:
- Use conversation history for added context
- Correct spelling errors contextually
- Break down complex queries as needed
- Paraphrase queries for clarity
- Rewrite queries using acronyms - you can do this providing the LLM with a JSON containing acronym definitions

Agent Orchestration

The agent orchestration layer brings together all components needed for end-to-end RAG pipelines, enabling seamless integration for GenAI deployments. It supports:

AI Models: Including multimodal models, embedding models, and enrichment models.
Developer Frameworks: Integrations with GitHub, Copilot Studio, Azure AI Foundry, and open-source tools.

Storage and Cost Optimisation in Azure AI Search

The session also covered techniques for optimising storage and reducing costs in Azure AI Search:

Scalar quantisation: Compresses vector data by lowering its resolution.
Binary quantisation: Encodes each vector component as a single bit (1 or 0).
Matryoska representative learning (MRL): Applies multi-level compression to vector embeddings.
Variable dimensions: Automatically reduces storage requirements for vector indexes.
Narrow types: Uses smaller data types (e.g., float16, int16, int8) for vector fields to minimise memory and disk usage.

Demo Highlights

Azure AI Foundry & OneLake Demo
A brief demo showcased how "top-k" chunks are used in RAG scenarios. The web app combined AI Foundry and AI Search to retrieve the top 3 most relevant matches for a user query. You are able to see information that the prompt included, such filters, semantic captions, and semantic ranking, in order to generate the search query.

Azure AI Foundry & AI Search Demo
This demo walked through deploying a model (e.g., GPT-4.0) in Azure AI Foundry and creating a new agent. You can provide AI instructions, such as specifying access to Azure AI Search. The agent was connected to an Azure AI Search resource containing cardiology data from OneLake. The "hybrid and semantic search" configuration was recommended for domain-specific queries. When a user asked a question, the agent returned responses grounded in the AI Search data, and you could inspect the underlying search queries.

Azure AI Search Portal (Agentic RAG Demo)
The final demo briefly showed how to create knowledge sources in the Azure AI Search portal. Within agentic retrieval, you can prioritise different knowledge sources for specific questions using the "Mode + Instructions" configuration.

Democratize Generating Business Insights with Azure Databricks through AI and No-Code

(Adam Wasserman - Databricks, Isaac Gritz - Databricks)

Next up was a session on Azure Databricks. I imagined it would be a session about the integration between Azure Databricks and Fabric, but the session was more of a tour through Databricks' latest features.

There's a new business user portal, dashboards, and an AI assistant that lets you "chat with your data" - sound familiar? The "unity catalog business metrics" feature lets you centrally define things like revenue and save it in the catalog.

Databricks Apps (GA) make it easy to build data apps over your data, which are "production ready" and built on serverless technology.

Databricks One brings together all your data, apps, and dashboards for business users. This is great for data discovery, and it also supports integration with things like Teams.

Lakeflows let you design no-code ETL (like dataflows). Alongside this GUI experience, you can also upload an image or table of your desired state, and AI will design it for you!

There is also support for designing no-code AI agents using "Agent Bricks".

Overall, it's clear that Databricks are moving in a direction similar to Fabric - with a focus on support for non-technical users and expanding out of the "ETL" space.

Scaling and Protecting Data Engineering in Fabric: Best Practices for Success

(Santhosh Kumar Ravindran - Microsoft, Ashit Gosalia - Microsoft)

Back in Fabric-land, we attended a session on Spark optimisation at scale. This session ran through a load of "tips" for optimisation:

The new(ish) autoscale billing for Spark lets you scale compute independently of capacity. This is great as it means that you can start to use Fabric in a more serverless way - something we're big fans of here at endjin!

There are new controls to turn off starter pools (which might be bigger than you need), turn off job bursting, create custom pools at the capacity level, set max session lifetimes, and enable responsive scale-down (which means that executors are decommissioned when not active). Combined, these give you a lot of ways to limit excess usage.

They stressed the importance of enabling the native execution engine for massive performance improvements (at no extra cost). The engineering team has done loads of work on the native engine: faster write speeds, file size optimisation, auto file size optimisation, and alerts for when you fall back from the native engine. All these need to be toggled in pool or notebook settings.

There's a lot of additional features around private networking - for example, you can now block all outbound traffic. This means that by default you are unable to access anything external, unless it's on a specific allow list.

They touched on the new custom live pools, which should go live in Q1 next year.

There are big performance improvements coming for uploading JAR files and python packages.

And, as mentioned previously in the previous blog on day 1, you can massively increase notebook concurrency.

Row and column-level security in Spark are coming soon.

And, finally, there's a new JDBC driver for integration with external orchestrators and systems.

Microsoft Purview Data Security Protections in the AI Era

(Shilpa Ranganathan - Microsoft, Anton Fritz - Microsoft)

The final session covered data security challenges in the age of AI. With the rise of AI-generated attacks and the risk of inadvertent data exposure through tools like Copilot, organisations need robust security frameworks.

Microsoft Purview provides a unified approach to data security and governance, offering estate-wide protection for Fabric (and beyond!). You can use Fabric's built-in security features, or add Purview-specific capabilities with an additional license.

Key Purview security features for Fabric include:

Information protection (GA) - Using the same sensitivity labels as Office and SharePoint and Azure, with label-based access controls that follow data throughout Fabric and downstream to Office, PDF files and Power BI reports.
Data loss prevention policies (GA) - Automatically scanning to detect sensitive data across semantic models, lakehouses, SQL, KQL, and mirrored databases. When violations are detected, you can take immediate remediation actions, including restricting access (Preview) to data owners only.
Data risk assessments (Preview) - Discovering overshared data and identifying potential leakage risks. The assessment page shows metrics like unique user access patterns across Fabric workspaces.
Insider risk management (GA) - Helps security teams identify potentially risky users and malicious activities, including monitoring Copilot prompts for Power BI.

Here are some general ways to audit, monitor, and govern data using Microsoft Fabric and Purview:

Audit & Monitoring

Fabric Admin Monitoring and Auditing: Fabric provides built-in admin monitoring tools to track activity, usage, and changes across your organisation’s data estate.
Purview Audit: Microsoft Purview offers comprehensive audit capabilities, enabling you to monitor data access, usage, and policy compliance across multiple platforms.
Integrated Auditing: Purview audit can be layered on top of Fabric’s native auditing, providing unified visibility and deeper insights into data operations and security events.

Metadata & Lineage

Fabric Lineage: Fabric’s lineage features allow you to trace data movement and transformations, helping you understand dependencies and the flow of data across workspaces.
OneLake Catalog: The OneLake catalog centralises metadata management, making it easier to discover, classify, and govern data assets within Fabric.
Purview Unified Catalog: Purview’s unified catalog extends metadata and lineage capabilities across your entire data estate, integrating with Fabric and other sources for holistic governance and compliance.

The session emphasised that while Fabric has strong built-in security, Purview adds enterprise-grade governance across the entire data estate, which is increasingly critical as AI capabilities expand.

Overall

A packed few days, with a truly ridiculous amount of announcements and features - with many set to go live over the coming months. I'm looking forward to exploring a lot of these in more detail. And with that, I headed out to the alps to make the most of a long weekend in Austria!

Keep an eye out for coming articles and videos as we dig into the practicalities, share hands-on experiences, and see how these new features can be used in real-world projects!

If you're interested in Microsoft Fabric, why not sign up to our new FREE Fabric Weekly Newsletter? We also run Azure Weekly and Power BI Weekly Newsletters too!

FabCon Vienna 2025: Day 2

Carmel Eve — Tue, 07 Oct 2025 06:15:00 GMT

Day 2 at FabCon, we shifted from the raft of announcements, into some more deep-dive technical sessions. We covered OneLake capabilities, the new Maps functionality, and some great AI integration demos.

Not only this... But I picked up my new favourite item of clothing!

Keynote - Exploring the Fabric Ecosystem: Databases, Security and AI at Scale

(Shireesh Thota - Microsoft, Kim Manis - Microsoft, Marco Casalaina - Microsoft, Jessica Hawk - Microsoft).

Day 2's Keynote was a far-reaching overview of the Fabric ecosystem as a whole.

They kicked off by showing how you can connect to Fabric items directly from VS Code, including deploying new SQL databases straight into your Fabric workspace.

They demonstrated "reverse-ETL" scenarios. These used Fabric to generate insights and then automate actions based on those insights. This is a powerful pattern for closing the loop between analytics and operations.

A significant chunk focused on AI, risk, and security. The new workspace-level security controls were interesting - Using these, you can enact policies that are either mandatory or monitored, with alerting for issues like bad data uploads.

There was new functionality highlighted for Fabric Data Agents:

Support for mirrored sources - meaning that mirrored data is now incorporated in queries and responses
Git integration (like all Fabric objects, agents are now supported in Git.)
You can also now invoke agents from outside Fabric - for example, from Azure Foundry.

They showed a real-time translation demo of a Teams call, which I can imagine being very useful! Translation in real-time would be another step in what (in my opinion) is one of the most powerful uses of AI as yet - in allowing communication across boundaries like never before.

Then came a GitHub integration demo. This used the "GitHub Spec Kit" to write specifications, then got an agent to create branches, implement changes, and raise PRs. Quite impressive, though I'd be interested to see how it's able to handle more complex scenarios and specifications! For the moment, I'd certainly at least want a human to be reviewing the PRs raised...

Bringing Data into OneLake: A Deep Dive into Shortcuts and Mirroring

(Trevor Olson - Microsoft, Maraki Ketema - Microsoft)

This session provided a thorough exploration of OneLake's data virtualisation capabilities.

Shortcuts

Shortcuts in OneLake are virtual pointers which allow you to reference data stored in external sources (such as Azure Data Lake Storage, Amazon S3, or other lakehouses), without physically copying or moving the data.

This enables seamless access and unified analytics across multiple storage systems, reducing data duplication and simplifying data management. Shortcuts are native items within OneLake, making it easy for users and services to query and process external data as if it were local, while maintaining a single source of truth and improving collaboration across teams. The shortcutted data behaves identically to native data (just with a different icon!).

There have been multiple enhancements for shortcuts in OneLake:

File cache (GA) with configurable retention and API reset capabilities. This is enabled in the OneLake settings.
Cross-cloud identity federation with S3 (public preview)
Enhanced network security, where you can support trusted service access and don't need to define the gateways yourself
Metadata transformation (GA)
Data transformation (preview) - where you can apply transforms (parquet to delta, JSON to delta, and AI transforms) as part of the shortcut
You can now edit shortcuts (GA) rather than recreating them
Git integration with variable mapping and workspace variables
Key Vault support (GA)
Azure Blob Storage shortcuts (GA) including anonymous auth
Improved Dataverse authentication options (Service Principal, Workspace ID)

Mirroring

If you want to see both your external data in OneLake and the metadata and catalog information that describes it, then you can use mirroring.

In mirroring scenarios, all metadata associated with the items is "mirrored" into the OneLake catalog. The data is also then mirrored, in one of two ways:

For open data formats, shortcuts are created to the data.
For propriety data, a replica of the data is created within OneLake, and kept in-sync using change management. The compute used for this process is free, and the storage itself is also free up to a point.

Open mirroring (preview) is a platform which allows you to extend mirroring to legacy systems and bespoke applications. Some partner companies have already built custom mirroring solutions on this platform.

The integration with Fabric Data Agents means you can pull mirrored data directly into agents and join tables from different sources. This is a powerful combination for cross-system analytics.

Geospatial Insights for Everyone with Maps now in Fabric

(Johannes Kebeck - Microsoft)

The next session we attended focused on the new Maps functionality in Fabric. It handles large datasets much better than Power BI maps and offers extensive customisation options. The integration with event streams enables real-time geospatial intelligence.

The technical implementation uses tilesets for performance. This creates vector tiles at different zoom levels and only showing data which will show up on the available pixels. All layers are stored as a single compressed file.

The demo connected to an event stream and grouped geometry data into H3 tiles (a geo-spatial indexing system developed by Uber) using KQL. They also showed a demo of using Fabric Maps to track faults across a power grid in real-time, and mentioned applications such as precision farming and queue management.

While it's not GA yet, once production-ready this could be a powerful tool for anyone needing to perform geospatial analysis, especially in real-time.

Operationalize Fabric-CICD Python Library to Work with Microsoft Fabric and Azure DevOps

(Kevin Chant - Macaw)

This session ended up being more of an Azure DevOps overview than the Git integration deep dive I was hoping for.

Here at endjin, coming from software development backgrounds, we are used to the option of working concurrently on software, at scale. Personally, I'm still a little unclear how you would achieve this given the current Git functionality in Fabric.

The advice seems to be to have a workspace per branch, or per developer, which they attach to a branch and make changes. But this solution doesn't seem like it would scale well across large development teams!

Note for clarity - this isn't a criticism of the session, which was a good introduction to Azure Devops and the fabric-cicd python library - it just wasn't the session I had assumed it was (due to my misinterpretation of the session title)! The question above around branching policies and how to operationalisation of Fabric CI/CD in general did come up, and caught my attention due to it being one of my leading remaining questions around how teams should adopt and start to use Fabric in anger.

Integrating Microsoft Fabric and Azure AI Foundry for a Real-World Scenario

(Monica Calleja - Microsoft, Sara Lammini Rodriguez - Microsoft)

This was an excellent session on integration between Fabric and Azure Foundry. They walked through a real-world scenario tackling customer churn at a bank using a multi-agent system in Azure Foundry.

The multi-agent architecture was made up of:

Customer Info Retriever Agent: Used a Fabric Data Agent published as an endpoint.
Customer Churn Prediction Agent: Connected to a machine learning model created in Fabric Data Science, trained on historical churn data and exposed via an endpoint.
Loyalty Programmes Agent: Used Azure AI Search for RAG over PDF documents containing loyalty programme information, which was all vectorised and searchable.
Orchestration Agent: Coordinated all three agents, determining which to call based on the question and parallelising calls to save tokens.

The Agents Playground in AI Foundry provides a testing environment before deploying to production apps. You could see exactly which agents were called and how the orchestration worked.

The end result was a chat-bot interface where business users could ask: "Give me all info on Customer X. Are they likely to churn? If so, recommend a loyalty programme." The orchestration agent would call all relevant agents and synthesise a response.

This demo showcased the true potential of multi-agent systems. Specialised agents can work together to solve complex business problems, that might otherwise be challenging - for humans, or for AI!

Overall

Day 2 provided much more hands-on technical content compared to the announcement-heavy Day 1. The OneLake deep dives, Maps functionality, and especially the multi-agent AI demo showed in-depth practical applications of the Azure stack, all brought together by Microsoft Fabric.

The day finished with a beautiful sunset over Vienna. Stay tuned for day 3!

If you're interested in Microsoft Fabric, why not sign up to our new FREE Fabric Weekly Newsletter? We also run Azure Weekly and Power BI Weekly Newsletters too!

Rx.NET v6.1 New Feature: ResetExceptionDispatchState()

Ian Griffiths — Fri, 03 Oct 2025 16:33:00 GMT

In this video, Ian Griffiths introduces the new ResetExceptionDispatchState operator in Rx.NET 6.1 released in October 2025.

He explains the peculiar behaviour of exception stack traces that led to the creation of this operator, following feedback from Adam Jones.

The video delves into how exception state is managed in .NET and the specific issues that arise when exceptions are reused without being re-thrown. Ian demonstrates the problem with code examples and shows how the new operator resolves it.

Full documentation is available at Introduction to Rx.NET.

00:00 Introduction to Rx.NET's New Feature
00:35 Background and Origin of the New Operator
02:35 Understanding Exception State in .NET
05:37 Demonstrating the Issue with Examples
12:31 Introducing the ResetExceptionDispatchState Operator
15:18 Conclusion and Further Resources

I'm gonna talk about the new reset exception dispatch state operator that we've added to Rx.NET. So we released Rx version 6.1 in October 2025. And the reason that's a minor bump release is that we've added some new features, so there's DisposeWith and TakeUntil(CancellationToken), and there are separate videos on both of them.

And today I'm gonna talk about the ResetExceptionDispatchState operator. Now if you've been following Rx's progress, you'll know we're also working on fixing some packaging issues, but that's not gonna come out until Rx v7.0.

What is the reset exception dispatch state operator about? So this originated from some feedback we had from Adam Jones. Adam initially reported a peculiar behavior, and then we discussed it and worked on some ways that we might be able to deal with this and eventually came up with the design for the ResetExceptionDispatchState, with his review and input. So, thank you very much to Adam!

So, the observed behavior that has led to this addition is that sometimes you can end up with some very odd looking stack traces in exceptions in Rx. So, if you use Rx in a way that an exception that originates from an observable source ends up turning into an actual thrown .NET exception. So, for example, if you await an IObservable and that IObservable produces an error, then sometimes you can end up with strange repetitions in your stack traces.

So, if you look at this stack trace on the screen here, you can see it's basically got the same information three times over. And actually where this was thrown from, the call stack just had that one time over. This does not reflect reality, and this was the behavior that led to the initial bug report.

Now this only happens under certain circumstances, so you'll only see this behavior if you have a single exception object that gets reused without being thrown between those uses. So without there being an actual C# throw operation or an equivalent in a different language. And not only does that have to happen, but Rx also has to convert that error from the normal Rx OnError mechanism into an actual rethrow. So, this happens when you await an observable that ends up calling OnError on its observer, for example.

And what's actually happening here is that the exception state is not being reset. So what do I mean by exception state? When an exception is thrown, the .NET runtime captures certain contextual information in order to report information about where the error actually came from. So this includes the stack trace, but it also includes certain other information that is known collectively as the crash bucket or the fault bucket. And if, for example, you're using Windows Error Reporting, this enables errors to be distinguished from one another.

So, the basic idea behind this exception state is to be able to distinguish between different causes of exceptions and application developers can sign their applications up to receive information about this if the user consents. And the basic idea here is, so if lots of users of your application are having the same problem, you can find out about that and prioritize fixing that problem over any others.

So the idea is that when an exception is thrown, this fault bucket information is captured and attached to the exception. And if that exception eventually gets thrown back out of the application and causes it to crash completely, that crash bucket information can be recorded by Windows. So that's the exception state that I'm talking about and it includes a stack trace as well.

And by design, Rx does not reset that exception state when you give it an exception, because there might be important information there, you might actually want to record that information. So actually the behavior we're seeing is not technically a bug, and actually you can get exactly the same weird multiple copies of the stack trace; it can happen with .NET without using Rx, this is actually a .NET runtime feature that occurs under certain circumstances, and the reason Rx is not resetting this exception state is that we want to flow exception state correctly in certain other circumstances and fundamentally, there are just some rules you have to follow if you don't wanna run into this behavior, but that was never previously obvious to developers using Rx.

So we've done a couple of things. We've updated the documentation on the https://introtorx.com site to clarify the rules around this, to clarify that the original problem that Adam Jones reported to us is actually expected behavior because the coding question was not conforming to the rules. We've now made it clear what the rules are, so that people can know they should expect this, but also even if you did understand the constraints, it wasn't always that easy to do the right thing, and so we've added this ResetExceptionDispatchState operator to make it easy to do the right thing.

So let's take a look at it in action. So I'm actually gonna start with an example that shows why it is that Rx doesn't just reset this exception state. So I'm here, I'm using the Observable.Create() method to build a new IObservable that actually executes by running some code. So what this does is it's gonna try and open a file, and then it's gonna read each line from the file in turn and then deliver that to whoever has subscribed to this observable. So, this is like an imperative way of writing an observable source.

And then down here I subscribe to that source and I attempt to pull out the first non-empty line from the file. So, the thing is that the file I'm trying to open doesn't actually exist on my system, so this line here is gonna fail. So, what do you think will happen when we try to essentially subscribe to this observable source down here? Let's run it and find out.

So, you can see we get this FileNotFoundException, and you can see I've actually got a double stack trace. We've got one of these markers here that says, there's a stack trace here, then another stack trace, and this is typical of asynchronous exception handling in .NET. You get the original point at which the exception was thrown, and we can see that's happened on line 5 of my program and that is indeed the line that tried to open the file in the first place. So that was the original point at which the exception got thrown, but then it got thrown again at a different location because we are using await.

So, line 13 of my program, if I look at that, that's this one here. Here is where it got re-thrown. So, it originally got thrown here. Rx actually caught it and then when we did the await Rx re-threw it and we get the full stack trace information and as it happens, the crash bucket's also gonna identify this line here, all that information is present because Rx didn't reset it. And this is by design. We don't want to throw away this information.

So this example shows exactly why we are not discarding that information by default. But now I'm gonna show you a program where that causes a problem. Let me switch the debugger over to that one.

So here I'm using the throw method to build an observable source that will instantly just report this exception to anyone who subscribes to it. So inside the subscribe, which is gonna call your observer's OnError, straight back passing in this exception. But then I've written this loop here that awaits that observable each time round the loop and then prints out the exception.

Let's run this and see what happens. So, the first time round the loop, it catches the exception and reports the stack trace and we can see it's come from inside some of Reactive's implementation, but we can see it's being reported as originating as line 12 of my code. Let's take a look at that. Line 12 is this one here, it is the await. So far so good.

But now I'm gonna let this run again and I just need to clear that selection. Okay, so now we've hit the exception handler again, but this time we've got two copies of the stack and that does not accurately reflect what happened. The stack's not getting any deeper here. This is just a reporting error. The exception stack trace has grown, and if I go around again this time, it's gonna be three high. And if I let this run all the way to the end, 10 times round the loop, then at the very last time, we got 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 copies of the stack trace.

So this is the problem that was originally reported to us, but as I said, it's not technically a bug. For one thing, this is nothing to do with Rx. We could actually take Rx completely out of the equation and just construct the exception directly. And then if instead of awaiting my now doesn't exist observable source, if I do an await of Task.FromException of that exception, so now I'm just using .NET's Task class and telling it to wrap this exception as a faulted task.

If I run this again and let's just breakpoint the same place. So the first time around we get just a single line exception. But if I run it around again, now we get two copies of the stack trace and we run it again. Now we get three copies and so on.

So this problem isn't unique to Rx. It happens anytime you cause an exception to be re-thrown, either through await or through some other mechanism that's able to re-throw a previously captured exception.

If you re-throw an exception that was never technically thrown in the first place, and here you can see I've constructed my exception, but there's no throw operation anywhere in sight. If you do that, then each time you re-throw the exception, it just appends a bit more information to the stack trace. Because .NET's not really expecting you to do this.

The assumption when you re-throw is that it thinks you already have an exception that was thrown and you want to append a bit more information to the exception state saying, okay, this will have the original stack trace in it, but now you are doing an await, so I am going to add in the stack trace for where you are calling this from, appending it to what was already there and because nothing ever resets this exception, it just goes round and round again, and just keeps getting longer and longer.

And so if I put this back how it was, go back to the Rx version, that is why this is happening. We're constructing a single exception object and we're repeatedly causing Rx to re-throw it, but there's no original throw event. You might think, shouldn't the throw operator do that for us? Shouldn't this reset the thing? Shouldn't this do what throw does in C#? The reason it doesn't is that Rx says, you might have attached important information in here, and there's no good way to discover when you've passed in an exception that doesn't have exception state attached, there isn't an efficient way for us to discover that. So we just have to assume it might have that state, and therefore we don't touch it.

So how do we fix this? The first thing to understand is what the rules are. So the first thing we've actually done is we have modified the documentation. So this is a page from the Intro to Rx book, which is available online on the https://introtorx.com website. It's also available in PDF form and the chapter about "Leaving Rx's World". So the chapter that talks about things like coming out of the world of IObservable and into the world of Task, which is what we are doing if we await an observable, we've added a section at the start here that talks about exception state and makes very clear what the rules are.

It says that if you are using a mechanism that takes exceptions delivered from an observable and that re-throws them, then you as the developer are responsible to ensure that either you don't reuse those exception objects. So each exception object is used just once or you do something to reset the exception dispatch state.

So this is the first thing we've done. We've documented these rules. Clearly these were never explicitly documented before, so it's not surprising people don't know that this isn't technically allowed. And then the other thing we've done is we've said, if you are in this situation where you do want to throw the same exception multiple times, we'll give you this new operator called ResetExceptionDispatchState.

So this operator passes all of the events that it receives straight through. So it doesn't change anything except for one thing: if the upstream source of the throw, in this case, if the upstream observable happens to produce an exception, if it calls OnError, then before this passes that on, it actually resets the ExceptionDispatchState, it does something equivalent to throwing the exception in order to reset the stack trace and all the other exception states.

So if we run this version, the first time around we see the stack trace pointing at line 13 of main, and also you'd get this inner one saying, okay, there's actually, the exception state was reset inside the ResetExceptionDispatchState. That's a side effect of us resetting the state inside that operator. So you'll always see this, but if we run it again, if I go this around the loop a second time, throwing the same exception, this time we just see one copy of the same stack trace. We don't see the thing growing. So if I let this run all 10 times, even the 10th time, you just get a single copy of the stack trace.

So to sum up, the basic problem here is that when you re-throw an exception in .NET without actually doing a real throw, you end up with these slightly strange results. The exception state accumulates, and that can cause odd looking stack traces. That's a basic .NET runtime behavior. We can't change this, and this is actually the expected behavior.

So what we've done is we've changed the documentation to make it clear that this is the situation. So people now have some way of knowing that they weren't supposed to do that in the first place. But we've also added this new operator ResetExceptionDispatchState, which you can add in to say, I don't want to keep the exception state. In fact, I need to not preserve the exception state. I need the exception state to be reset every single time an exception flows through this point in my subscription. And if you put that in, it will reset the exception each time the source produces it, avoiding the problem.

And if you go to https://introtorx.com, you can see much more detail about exactly how this works.

My name's Ian Griffiths. Thanks very much for listening.

Rx.NET v6.1 New Feature: DisposeWith()

Ian Griffiths — Fri, 03 Oct 2025 16:32:00 GMT

In this episode, Ian Griffiths introduces the new DisposeWith method available in Rx.NET version 6.1, released in October 2025.

He discusses the new functionalities, including a new TakeUntil() overload and the ResetExceptionDispatchState operator, with additional videos to cover these features.

Ian explains that DisposeWith is a community contribution by Chris Pullman, designed to facilitate a fluent coding style by simplifying the disposal of multiple subscriptions. He demonstrates the method in a console application, comparing it with the traditional CompositeDisposable approach, and highlights its ease of use for handling observable subscriptions.

Full documentation is available at Introduction to Rx.NET.

00:00 Introduction to Dispose Operator in Rx.NET
00:43 Community Contribution and Background
01:26 Purpose and Design of DisposeWith Method
01:47 Demo: DisposeWith Method in Action
02:11 CompositeDisposable vs DisposeWith
03:22 Fluent Style Development with DisposeWith
04:32 Conclusion and Further Resources

I'm going to tell you about a new operator called DisposeWith that we've added to Rx.NET in version 6.1.

In October 2025, we released Rx v6.1, and it's a new minor release because it has new functionality. It has a new DisposeWith() method, which is the subject of this video. It has a TakeUntil() overload that takes a CancellationToken, and there's another video about that, and there'll also be another video about the new ResetExceptionDispatchState operator.

Now, if you've been following Rx for a while, you know we're also working on fixing some packaging options, but that's not going to happen till the next major release: v7.

So what's this DisposeWith method all about?

The first thing I should say is that this is a community contribution. Thanks very much to Chris Pullman for suggesting this and providing the code, being patient while we got it into place. Now Chris, as you may be aware, works on the ReactiveUI Framework. He's a very active contributor to the open source community, a long-term user of Rx with a deep understanding of the framework, and he felt this would be a very useful feature for the way that people tend to work with ReactiveUI. It's a fluent style of API, and people working with ReactiveUI find this way of working very useful. That's why we've decided to incorporate this suggestion into the Rx library.

The basic purpose of this addition is to make it simpler to discard multiple subscriptions with a single action. The way the API has been designed is to support what we call a fluent coding style. The best way to show how this works will be to show a quick demo.

Let's take a look at this new method in action. Right now I've got a very simple console application using Rx that has a couple of observable sources and it subscribes to both of them, lets those subscriptions run, and then when the user presses enter, it tears each of them down individually. This works. You can see they're both running there. When I hit enter, they both shut down, and that's fine. The thing is that I've had to call dispose separately on each one, and some people find that a bit onerous.

For a long time in Rx, we've had this thing called CompositeDisposable, which is essentially a collection of disposables, so we could just do this: add both of those disposables to this composite, and then we only have to call dispose on that composite object and the behavior should be exactly the same. Both my subscriptions are running. I hit enter. Both of them shut down, but this arguably hasn't really saved me anything because I actually got, if anything, slightly more code here.

We could do a little better. We could observe that this is actually a collection, and so maybe I could use the collection syntax to initialize this thing. Now I've said my composite disposable consists of a couple of these things, and now they're both running. I hit enter, they both get disposed.

That's an option, but I have had to rather reshape my code to work around this particular way of initializing it.

The idea with this new DisposeWith extension method is that it fits in a bit better with how you might want to set up your observable subscriptions in the first place. Instead of having to do this, we can just do .DisposeWith(cd), so we just add this on the tail end of whatever our observable sequence for subscription ends up looking like. Now if I run this once more, I hit enter, my single call to the composite disposable disposes both of them.

It's a very simple change. It doesn't really add anything you couldn't do before, but it does enable this, what we call fluent style, development.

You'll notice that it brought in an additional namespace here. This means that if you're not a fan of fluent APIs—and not everyone is—then this won't appear in your namespace. Because this does become available on all disposables, it's only there if you choose to have it there. But if you like this style of development, then this is there for you.

In conclusion, this is a simple way to use the CompositeDisposable class to handle multiple unsubscriptions in a single action using a fluent style of API. If you want to use this, just download System.Reactive 6.1, add the using directive for the System.Reactive.Disposables.Fluent namespace, and then you can just call DisposeWith to attach each subscription to your existing composite disposable. We've documented this on https://introtorx.com if you want more details.

Thanks for listening. My name's Ian Griffiths.

Rx.NET v6.1 New Feature: TakeUntil(CancellationToken)

Ian Griffiths — Fri, 03 Oct 2025 16:31:00 GMT

In this video, Ian Griffiths introduces the new TakeUntil(CancellationToken) operator in Rx.NET 6.1, released in October 2025.

He discusses the purpose and functionality of this operator, which allows users to stop an infinite source when a cancellation token is signalled.

Ian acknowledges the contributions of community members Nils Aufschläger and Daniel Weber in shaping and developing this feature. Through a simple example using the Interval operator, Ian demonstrates how this new operator works and explains its benefits.

Learn how to manage infinite sources effectively with the new TakeUntil(CancellationToken) operator in Rx.NET 6.1.

Full documentation is available at Introduction to Rx.NET.

00:00 Introduction to Rx.NET 6.1 and New Features
00:47 Community Contributions and Design Evolution
01:27 Understanding the TakeUntil(CancellationToken) Operator
03:08 Practical Example: Using TakeUntil with Interval
06:00 Summary and Documentation

I am gonna tell you about a new operator introduced in Rx version 6.1, the TakeUntil(CancellationToken) operator. So in October 2025, we've released version 6.1, Rx.NET. It's a new minor version release because we've got new features in there, but no breaking changes. The new features are DisposeWith, which I'll talk about in a different video, the TakeUntil(CancellationToken) operator, which is the topic for this video, and also one called ResetExceptionDispatchState. Now, if you've been following the recent progress of Rx, you'll be aware that we're also looking to solve some longstanding packaging problems. That has not been done in the v6.1 release; that's coming in v7.0.

Okay, so let's take a look at this new operator TakeUntil(CancellationToken). So just to start with, I should say thank you to the community people who have helped this come into being. So this was originally suggested as a slightly different form of feature, then changed shape.

So the original proposal came from someone whose name I suspect I will mispronounce, but I'll try anyway: Nils Aufschläger. So thank you for the initial proposal, and he also did the work to implement this, but the design changed during the discussion, and that is partly thanks to Daniel Weber, who chipped into the conversation and suggested a completely different approach from the one that Nils originally wanted to take.

What exactly is this? What's the purpose behind this new operator? Because we've had TakeUntil in various forms for a long time in Rx. Basically it says: take all the elements from this source until some criterion is met. And the new feature is that we are able to stop taking elements—to complete the source—when a cancellation token gets signaled.

And the purpose of this is to enable you to take an infinite source—so some source that would never complete on its own. So things like the interval timer are like this, but also the event sources that come from events in .NET. If you adapt a regular .NET event into an Rx source, it never completes by itself.

And that was actually the original motivation for this: was to enable those sorts of event sources to be completed. The original design was gonna be a specialized disposable implementation, but when we looked at that, it was kind of awkward. We didn't really have any other things that were both disposables and observables, and it wasn't clear how you would implement something that was both in a way that fit in well with the existing examples of those kinds of types in Rx.NET. So Daniel Weber's suggestion was that actually we could get the same effect—we could solve the same problem—by introducing this new overload of TakeUntil that takes a CancellationToken. And it's actually more flexible; it can work with anything that produces CancellationTokens. And we already have the CancellationDisposable in Rx.NET, which gives you a thing that you can dispose and which sets a CancellationToken when you do so. That was the way we moved forward.

Okay, so let's see this in action. I've got a very simple Rx example here. This is using the Interval operator that's been built into Rx forever, and this is just gonna raise events every second. So it's gonna produce the number 0, 1, 2, 3, and so on indefinitely. And the thing about interval is that it never stops. So when I run this, it's just gonna keep on producing numbers again and again and again, until either we unsubscribe or the process exits.

So we could just unsubscribe. So the Subscribe method does return an IDisposable, and we could hang onto that and then just call Dispose() on that, and that will certainly shut things down. However, what it won't do is enable the complete events to flow through. So when you unsubscribe, you never see the OnCompleted—or at least you might not see it. It's possible you'll see it during your call to Subscribe(), but there's no guarantee. So if you definitely want to see that, you really want the source itself to shut down, and that's the idea of this new operator—or this new overload of the TakeUntil operator.

So we're gonna need a CancellationToken. So I'm gonna use a CancellationDisposable, which is a type that has been built into Rx for a very long time, and that is able to provide us with a CancellationToken. So what we can do is say: let's take items from this until that CancellationToken is set. And what I can now do is, if I add a couple more things in here, if I dispose of that CancellationDisposable after the first time someone hits enter, that should shut down this observable. Let me just print out another thing here saying "Disposed," and then we're gonna keep running until we get a second key press. So if I run this now, then as before it starts producing numbers, and it will do so—or the underlying interval source will do that—indefinitely. But if I hit enter, that disposes the CancellationDisposable, and that causes the CancellationToken to enter the canceled state, and then our TakeUntil operator says, "Oh, okay, we're done then."

And we should now be able to see, if I add to my subscribe a second callback that says "Completed"—if I run this one more time. So that's producing numbers. I'm gonna hit enter to cause it to dispose that, and you can see right away the source completes properly. So this subscription has been shut down from the source end.

We haven't unsubscribed; we've caused the source itself to complete, and that is what TakeUntil enables us to do.

So in summary, this new TakeUntil operator overload enables us to cause otherwise infinite sources to complete when we want them to. So if you want to use this, you need to get hold of System.Reactive v6.1, and you need to then get a CancellationToken from somewhere. And you can use Rx's CancellationDisposable as one way of doing that, and then you just pass the token into the new TakeUntil overload.

If you want more details, we've documented this operator on the https://introtorx.com site.

And meanwhile, I'm Ian Griffiths. Thanks for listening.

Rx.NET v6.1 Now Available

Ian Griffiths — Fri, 03 Oct 2025 16:30:00 GMT

We've just released a new version of the Reactive Extensions for .NET (Rx.NET). System.Reactive v6.1.0 is now available on NuGet.

What's new?

We have three new features (hence the bump in the minor version number):

DisposeWith extension method for use with CompositeDisposable (see video)
New TakeUntil overload taking a CancellationToken (see video)
New ResetExceptionDispatchState operator (see video)

`DisposeWith` extension method

Thanks to Chris Pulman for implementing this.

This simplifies the use of CompositeDisposable when using a 'fluent' coding style. E.g.:

CompositeDisposable d = new();

someObservable1.Subscribe(myObserver1)
    .DisposeWith(d);
someObservable2.Subscribe(myObserver2)
    .DisposeWith(d);

This video provides more detail:

`TakeUntil(CancellationToken)`

Thanks to Nils Aufschläger for the initial suggestion, and for implementing the operator. Thanks to Daniel Weber for proposing the design we ultimately chose.

The problem we wanted to solve was to be able to take an 'infinite' IObservable (one that never calls OnComplete) and cause it to complete. Nils wanted to be able to trigger this completion by calling Dispose on some object. Daniel pointed out that if we added an overload to TakeUntil that accepts a CancellationToken, we can use Rx's existing CancellationDisposable to achieve this, while also enabling any other cancellation source to be used as well.

This video provides more detail:

`ResetExceptionDispatchInfo`

Thanks to Adam Jones for the initial issue report, and for reviewing our work on this.

Adam reported a peculiar behaviour in which Rx was causing the StackTrace reported by an exception to grow longer each time it rethrew that exception. This would happen if the following were both true:

something delivers the same exception to OnError more than once (e.g., because of multiple subscriptions to the same Observable.Throw)
you perform multiple awaits that cause that same exception to be thrown

Here's a contrived example that illustrates the issue.

IObservable ox = Observable.Throw(new Exception("Bang!"));

try
{
    await ox;
}
catch (Exception x)
{
    // Exception will look normal here.
    Console.WriteLine(x);
}

try
{
    await ox;
}
catch (Exception x)
{
    // Exception will have duplicated stack trace here.
    Console.WriteLine(x);
}

Despite how it looks, this is not actually a bug. .NET itself will do exactly the same thing if you use await Task.FromException(ex) twice on the same exception object. The basic issue here is that there are some rules around when it is acceptable to rethrow an exception.

If we were to modify either Throw or Rx's await integration to stop this from happening, it would break some other important scenarios in which currently we flow exception origin information correctly. That means that the code shown above will always exhibit this behaviour and we can't change that without causing new problems.

Instead, in this release of Rx, we've done two things to address this issue:

The documentation at IntroToRx.com now alerts readers to this problem and describes how to avoid it in the various places where it is a relevant factor
We've added a new ResetExceptionDispatchState operator that enables you to get the behaviour you might have expected.

We can avoid this stack trace repetition by using Rx 6.1's new ResetExceptionDispatchState operator. All it takes is a change to the first statement:

IObservable ox = Observable
    .Throw(new Exception("Bang!"))
    .ResetExceptionDispatchState();

This tells Rx that the exception won't be populated with state such as a correct stack trace or fault bucket information at the point where it emerges from Throw, and so we need Rx to reset that information. (In effect, this operator performs a throw at the instant the exception emerges from the source.)

Existing scenarios that were relying on Rx's behaviour of preserving exception dispatch state info will continue to work because we have not changed the core behaviour. But examples that require that state to be reset on each call to OnError now have a straightforward way to achieve that.

This video provides more detail:

What's next?

Now that Rx 6.1 is out, we are turning our attention to the next release. It will need to be a new major version (Rx v7.0) because it will make these breaking changes:

We will remove net6.0 support
UI-framework-specific functionality will no longer be available through System.Reactive's compile-time public API (but will remain in runtime assemblies for binary backwards compatibility); this functionality will move into new NuGet packages

These are expressed as negatives because major version bumps are always about breaking changes. So what's the upside?

When .NET 10 ships, we will of course be supporting that. But if we don't manage to get Rx v7.0 out of the door shortly after .NET 10 ships, we'll just add .NET 10 tests to the test suite for Rx 6.1. So it's not yet clear whether upgrading to v7.0 will be required for us to offer .NET 10 support. (That said, Rx 6.0 and 6.1 will in fact work just fine on .NET 10.0. It's just a question of whether we officially support that. As far as I know, Rx 4.4 also works on .NET 9 today, but that's not a combination we support.)

The reason for the UI framework packaging changes in Rx 7.0 is that it will solve a problem that happens today if you build .NET apps with AoT or self-contained deployment that target Windows. With Rx 6.0 and 6.1, if your application has a Windows-specific TFM targetting Windows 10.0.19041 or later (e.g. net6.0-windows10.0.19041 or net9.0-windows10.0.22621), the System.Reactive library imposes dependencies on WPF and Windows Forms. So even if you're not using either of those frameworks, an AoT deployment or a self-contained deployment will include a copy of both of those frameworks. This makes the deployable code tens of megabytes larger than it needs to be.

Unsurprisingly, this has proven unpopular. The AvaloniaUI project abandoned Rx.NET completely because of it. So we will be fixing it in Rx 7.0. We have gone to extreme lengths to minimize the impact on existing code, but there will be some breaking changes in some situations, which is why we will be bumping the major version number.

Please try it out

This new 6.1 release of System.Reactive is available on NuGet today. If you're using Rx in your application, please try upgrading. If you have any problems, please file issues at https://github.com/dotnet/reactive/issues. Meanwhile, we hope you enjoy this new version of the Reactive Extensions for .NET.

Batch Processing Triggered Pipeline Runs in Azure Synapse

James Broome — Thu, 02 Oct 2025 06:32:00 GMT

Overcoming Event-Driven Limitations: Batching Triggered Pipeline Runs in Azure Synapse

Azure Synapse Pipelines and Azure Data Factory Pipelines provide a flexible way to orchestrate data movement and transformation. Event-driven triggers are a useful feature, allowing pipelines to start automatically as soon as data lands in a storage account. This "just-in-time" processing is often ideal for responsive data ingestion and near-real-time analytics - something that's heavily pushed via the various "Synapse Link for..." integrations.

However, there are scenarios where a purely event-driven approach isn't quite the optimal fit. What if you have a high volume of small, frequent events, and you'd prefer to process them in larger, more efficient batches? Or perhaps you need to ensure that a pipeline, once triggered, doesn't immediately re-trigger if subsequent events arrive within a very short window, leading to an unwanted cascade of runs. This is where the limitations of the out-the-box event-based triggers can become apparent – they don't natively support batching.

Frequent, bursty pipeline triggers can cause numerous problems. When multiple external events (file arrivals, API calls, whatever your trigger source) try to kick off the same pipeline in short succession, you can end up with a queue of pipeline runs all doing the same thing, maybe with slight variations of the data being processed. Depending on the pipeline's complexity and the resources it consumes, this can lead to:

Long Delays: If 5 updates happened within a short period, and your pipeline takes 10 mins to run, you'll be waiting a long time to see the latest outputs
Race Conditions: If your pipeline isn't designed to be idempotent, concurrent runs can lead to data inconsistencies
Resource Contention: Multiple runs fighting for the same compute, storage, or API limits
Performance Degradation: Everything slows down because the system is overloaded
Unnecessary Costs: Paying for multiple parallel runs when a single, batched run would suffice

In this post, I'm going to describe a reusable pattern that solves the problems described above, using out-the-box pipeline activities to achieve "batched triggered" pipeline runs in Azure Synapse Pipelines (and can equally be applied in Azure Data Factory). In a subsequent post, I'll describe how the same pattern can be adapted for Microsoft Fabric Data Pipelines.

The Batched Trigger Orchestrator Pattern

The core idea behind this pattern is to introduce an "orchestrator" pipeline that is called from a parent "workload" pipeline. Upon the orchestrator being triggered, it checks for any other pending, queued runs of the same parent workload pipeline. If it finds them, it effectively signals to the parent pipeline to defer its own execution, passing on to the next triggered pipeline run in the queue. By specifying a configurable delay window, the net effect is that we can control how frequently a "batch" of triggered workload pipeline runs that have accumulated are actually executed.

For the approach to work, the parent workload pipeline needs to be set to only run once at a time using the pipeline concurrency settings. This forces the queue of trigger runs to build up, so that the orchestrator can decide which runs to execute and which runs to ignore. The parent workload pipeline receives the response from the orchestrator telling it whether to proceed or exit (quickly), and reacts accordingly.

Before we take a look at the pipeline definition, let's start with some scenarios.

Scenario 1: Single triggered pipeline (No Queued Runs)

In the simplest case, where a single event triggers the pipeline, and no other events arrive within the batching window, then the pipeline executes as expected:

sequenceDiagram
    participant D1 as Data Source
    participant E1 as Storage Event Trigger
    participant P1 as Workload Pipeline
    participant O1 as Batched Trigger Orchestrator Pipeline
    participant SynapseAPI as Synapse API
    participant P1_Work as Pipeline activities

    D1->>E1: New event occurs (e.g., file arrival)
    E1->>P1: Workload Pipeline is triggered
    P1->>O1: Execute orchestration pipeline (20 second delay)
    O1->>SynapseAPI: Query Synapse API for queued Workload Pipeline runs since trigger time 
    SynapseAPI-->>O1: Returns empty list
    O1->>O1: Delay for 20 seconds
    Note over O1: No other Workload Pipeline runs initiated during 20 seconds
    O1->>SynapseAPI: After 20s, query Synapse API for queued Workload Pipeline runs between trigger time and +20 seconds 
    SynapseAPI-->>O1: Returns empty list
    O1-->>P1: Signals result: Continue Execution
    P1->>P1: Checks result
    P1->>P1_Work: Execute workload
    P1_Work-->>P1: Work completes
    P1-->>P1: Workload Pipeline completes

Scenario 2: Multiple triggered events (Effective Batching)

In this scenario, several events arrive in quick succession, demonstrating how the orchestrator allows only one instance (the most recent) of the workload pipeline runs to proceed, effectively batching the work.

sequenceDiagram
    participant D1 as Data Source
    participant E1 as Storage Event Trigger
    participant P1 as Workload Pipeline (Run 1)
    participant O1 as Batched Trigger Orchestrator Pipeline (Run 1)
    participant P2 as Workload Pipeline (Run 2)
    participant O2 as Batched Trigger Orchestrator Pipeline (Run 2)
    participant P3 as Workload Pipeline (Run 3)
    participant O3 as Batched Trigger Orchestrator Pipeline (Run 3)
    participant SynapseAPI as Synapse API
    participant P1_Work as Pipeline activities

    D1->>E1: New event occurs (e.g., file arrival)
    E1->>P1: Workload Pipeline is triggered
    P1->>O1: Execute orchestration pipeline (20 second delay)
    O1->>SynapseAPI: Query Synapse API for queued Workload Pipeline runs
    SynapseAPI-->>O1: Returns empty list
    O1->>O1: Delay for 20 seconds

    D1->>E1: New event occurs (+5 seconds) (e.g., file arrival)
    E1->>P2: Workload Pipeline is queued

    D1->>E1: New event occurs (+10 seconds) (e.g., file arrival)
    E1->>P3: Workload Pipeline is queued
   

    O1->>SynapseAPI: After 20s, query Synapse API for queued Workload Pipeline runs between trigger time and +20 seconds
    SynapseAPI-->>O1: Returns list containing Run 2 and 3
    O1-->>P1: Signals result: Cancel Execution
    P1->>P1: Checks result
    P1-->>P1: Workload Pipeline exits

    P2->>O2: Execute orchestration pipeline (20 second delay)
    O2->>SynapseAPI: Query Synapse API for queued Workload Pipeline runs between trigger time and +20 seconds
    SynapseAPI-->>O2: Returns list containing Run 3
    O2->>O2: Signals result: Cancel Execution
    P2->>P2: Checks result
    P2-->>P2: Workload Pipeline exits

    P3->>O3: Execute orchestration pipeline (20 second delay)
    O3->>SynapseAPI: Query Synapse API for queued Workload Pipeline runs between trigger time and +20 seconds
    SynapseAPI-->>O3: Returns empty list
    O3->>O3: Signals result: Continue Execution
    P3->>P3: Checks result
    P3->>P1_Work: Execute workload
    P1_Work-->>P3: Work completes
    P3-->>P3: Workload Pipeline completes

Scenario 3: Events spread out (Multiple Batches)

This scenario shows what happens if multiple triggered events arrive with sufficient gaps, leading to multiple successful runs of the workload pipeline.

sequenceDiagram
    participant D1 as Data Source
    participant E1 as Event Trigger
    participant P1 as Workload Pipeline (Run 1)
    participant O1 as Batched Trigger Orchestrator Pipeline (Run 1)
    participant P2 as Workload Pipeline (Run 2)
    participant O2 as Batched Trigger Orchestrator Pipeline (Run 2)
    participant SynapseAPI as Synapse API
    participant P_Work as Pipeline activities

    D1->>E1: New event occurs (e.g., file arrival)
    E1->>P1: Workload Pipeline is triggered
    P1->>O1: Execute orchestration pipeline (20 second delay)

    O1->>SynapseAPI: Query Synapse API for queued Workload Pipeline runs 
    SynapseAPI-->>O1: Returns empty list
    O1->>O1: Delay for 20 seconds
    O1->>SynapseAPI: After 20s, query Synapse API for queued Workload Pipeline runs between trigger time and +20 seconds
    SynapseAPI-->>O1: Returns empty list
    O1-->>P1: Signals result: Continue Execution
    P1->>P1: Checks result
    P1->>P_Work: Execute workload
    P_Work-->>P1: Work completes
    P1-->>P1: Workload Pipeline completes

    D1->>E1: New event occurs (+30 seconds) (e.g., file arrival)
    E1->>P2: Workload Pipeline is triggered
    P2->>O2: Execute orchestration pipeline (20 second delay)

    O2->>SynapseAPI: Query Synapse API for queued Workload Pipeline runs 
    SynapseAPI-->>O2: Returns empty list
    O2->>O2: Delay for 20 seconds
    O2->>SynapseAPI: After 20s, query Synapse API for queued Workload Pipeline runs between trigger time and +20 seconds
    SynapseAPI-->>O2: Returns empty list
    O2-->>P2: Signals result: Continue Execution
    P2->>P2: Checks result
    P2->>P_Work: Execute workload
    P_Work-->>P2: Work completes
    P2-->>P2: Workload Pipeline completes

Pipeline Definition: Batched Orchestrator

Here's the JSON definition for the "Batched Trigger Orchestrator" pipeline that can be used by any workload pipeline that needs to implement this pattern. This pipeline is entirely reusable and generic:

{
    "name": "Batched Trigger Orchestrator",
    "properties": {
        "activities": [
            {
                "name": "Initial pipeline queue check",
                "type": "WebActivity",
                "dependsOn": [],
                "policy": {
                    "timeout": "0.12:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "method": "POST",
                    "url": {
                        "value": "https://@{pipeline().DataFactory}.dev.azuresynapse.net//queryPipelineRuns?api-version=2020-12-01",
                        "type": "Expression"
                    },
                    "connectVia": {
                        "referenceName": "AutoResolveIntegrationRuntime",
                        "type": "IntegrationRuntimeReference"
                    },
                    "body": {
                        "value": "{\n  \"lastUpdatedAfter\": \"@{pipeline().parameters.CallingPipelineTriggerTime}\",\n  \"lastUpdatedBefore\": \"@{utcNow()}\",\n  \"filters\": [\n    {\n      \"operand\": \"PipelineName\",\n      \"operator\": \"Equals\",\n      \"values\": [\n        \"@{pipeline()?.TriggeredByPipelineName}\"\n      ]\n    },\n    {\n      \"operand\": \"Status\",\n      \"operator\": \"Equals\",\n      \"values\": [\n        \"Queued\"\n      ]\n    }\n  ]\n}\n",
                        "type": "Expression"
                    },
                    "authentication": {
                        "type": "MSI",
                        "resource": "https://dev.azuresynapse.net/"
                    }
                }
            },
            {
                "name": "If no queued runs then delay",
                "type": "IfCondition",
                "dependsOn": [
                    {
                        "activity": "Update queued run count",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "userProperties": [],
                "typeProperties": {
                    "expression": {
                        "value": "@greater(length(activity('Initial pipeline queue check').output.value), 0) ",
                        "type": "Expression"
                    },
                    "ifFalseActivities": [
                        {
                            "name": "Delay",
                            "type": "Wait",
                            "dependsOn": [],
                            "userProperties": [],
                            "typeProperties": {
                                "waitTimeInSeconds": {
                                    "value": "@pipeline().parameters.DelayInSeconds",
                                    "type": "Expression"
                                }
                            }
                        },
                        {
                            "name": "Delayed pipeline queue check",
                            "type": "WebActivity",
                            "dependsOn": [
                                {
                                    "activity": "Delay",
                                    "dependencyConditions": [
                                        "Succeeded"
                                    ]
                                }
                            ],
                            "policy": {
                                "timeout": "0.12:00:00",
                                "retry": 0,
                                "retryIntervalInSeconds": 30,
                                "secureOutput": false,
                                "secureInput": false
                            },
                            "userProperties": [],
                            "typeProperties": {
                                "method": "POST",
                                "url": {
                                    "value": "https://@{pipeline().DataFactory}.dev.azuresynapse.net//queryPipelineRuns?api-version=2020-12-01",
                                    "type": "Expression"
                                },
                                "connectVia": {
                                    "referenceName": "AutoResolveIntegrationRuntime",
                                    "type": "IntegrationRuntimeReference"
                                },
                                "body": {
                                    "value": "{\n  \"lastUpdatedAfter\": \"@{pipeline().parameters.CallingPipelineTriggerTime}\",\n  \"lastUpdatedBefore\": \"@{utcNow()}\",\n  \"filters\": [\n    {\n      \"operand\": \"PipelineName\",\n      \"operator\": \"Equals\",\n      \"values\": [\n        \"@{pipeline()?.TriggeredByPipelineName}\"\n      ]\n    },\n    {\n      \"operand\": \"Status\",\n      \"operator\": \"Equals\",\n      \"values\": [\n        \"Queued\"\n      ]\n    }\n  ]\n}\n",
                                    "type": "Expression"
                                },
                                "authentication": {
                                    "type": "MSI",
                                    "resource": "https://dev.azuresynapse.net/"
                                }
                            }
                        },
                        {
                            "name": "Update queued run count again",
                            "type": "SetVariable",
                            "dependsOn": [
                                {
                                    "activity": "Delayed pipeline queue check",
                                    "dependencyConditions": [
                                        "Succeeded"
                                    ]
                                }
                            ],
                            "policy": {
                                "secureOutput": false,
                                "secureInput": false
                            },
                            "userProperties": [],
                            "typeProperties": {
                                "variableName": "QueuedRunCount",
                                "value": {
                                    "value": "@length(activity('Delayed pipeline queue check').output.value) ",
                                    "type": "Expression"
                                }
                            }
                        }
                    ]
                }
            },
            {
                "name": "Update queued run count",
                "type": "SetVariable",
                "dependsOn": [
                    {
                        "activity": "Initial pipeline queue check",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "policy": {
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "variableName": "QueuedRunCount",
                    "value": {
                        "value": "@length(activity('Initial pipeline queue check').output.value) ",
                        "type": "Expression"
                    }
                }
            },
            {
                "name": "If any queued runs then cancel execution",
                "type": "IfCondition",
                "dependsOn": [
                    {
                        "activity": "If no queued runs then delay",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "userProperties": [],
                "typeProperties": {
                    "expression": {
                        "value": "@greater(variables('QueuedRunCount'), 0)",
                        "type": "Expression"
                    },
                    "ifFalseActivities": [
                        {
                            "name": "Continue execution",
                            "type": "SetVariable",
                            "dependsOn": [],
                            "policy": {
                                "secureOutput": false,
                                "secureInput": false
                            },
                            "userProperties": [],
                            "typeProperties": {
                                "variableName": "pipelineReturnValue",
                                "value": [
                                    {
                                        "key": "ContinueExecution",
                                        "value": {
                                            "type": "Boolean",
                                            "content": true
                                        }
                                    }
                                ],
                                "setSystemVariable": true
                            }
                        }
                    ],
                    "ifTrueActivities": [
                        {
                            "name": "Cancel execution",
                            "type": "SetVariable",
                            "dependsOn": [],
                            "policy": {
                                "secureOutput": false,
                                "secureInput": false
                            },
                            "userProperties": [],
                            "typeProperties": {
                                "variableName": "pipelineReturnValue",
                                "value": [
                                    {
                                        "key": "ContinueExecution",
                                        "value": {
                                            "type": "Boolean",
                                            "content": false
                                        }
                                    }
                                ],
                                "setSystemVariable": true
                            }
                        }
                    ]
                }
            }
        ],
        "parameters": {
            "DelayInSeconds": {
                "type": "int",
                "defaultValue": 1
            },
            "CallingPipelineTriggerTime": {
                "type": "string"
            }
        },
        "variables": {
            "QueuedRunCount": {
                "type": "Integer",
                "defaultValue": 0
            }
        },
        "annotations": []
    }
}

Pipeline Definition: Example workload pipeline, using the Batched Orchestrator pattern

Here's the JSON definition of an example pipeline that uses the batched orchestrator. Note the following important points:

Concurrency is set to 1
The return value from executing the Batched Orchestrator Pipeline is checked to determine whether to proceed with the main pipeline workload, or to exit processing.

{
    "name": "Example Batched Trigger Pipeline",
    "properties": {
        "activities": [
            {
                "name": "Orchestrate Batched Triggers",
                "type": "ExecutePipeline",
                "dependsOn": [],
                "policy": {
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "pipeline": {
                        "referenceName": "Batched Trigger Orchestrator",
                        "type": "PipelineReference"
                    },
                    "waitOnCompletion": true,
                    "parameters": {
                        "DelayInSeconds": 20,
                        "CallingPipelineTriggerTime": {
                            "value": "@pipeline().TriggerTime",
                            "type": "Expression"
                        }
                    }
                }
            },
            {
                "name": "Check if should continue execution",
                "type": "IfCondition",
                "dependsOn": [
                    {
                        "activity": "Orchestrate Batched Triggers",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "userProperties": [],
                "typeProperties": {
                    "expression": {
                        "value": "@equals(activity('Orchestrate Batched Triggers').output.pipelineReturnValue.ContinueExecution, true)",
                        "type": "Expression"
                    },
                    "ifFalseActivities": [
                        {
                            "name": "Cancel execution",
                            "type": "SetVariable",
                            "dependsOn": [],
                            "policy": {
                                "secureOutput": false,
                                "secureInput": false
                            },
                            "userProperties": [],
                            "typeProperties": {
                                "variableName": "pipelineReturnValue",
                                "value": [
                                    {
                                        "key": "Result",
                                        "value": {
                                            "type": "String",
                                            "content": "Cancelling execution due to batched trigger orchestration"
                                        }
                                    }
                                ],
                                "setSystemVariable": true
                            }
                        }
                    ],
                    "ifTrueActivities": [
                        {
                            "name": "Do work",
                            "type": "Wait",
                            "dependsOn": [],
                            "userProperties": [],
                            "typeProperties": {
                                "waitTimeInSeconds": 30
                            }
                        }
                    ]
                }
            }
        ],
        "concurrency": 1,
        "variables": {
            "QueuedRunCount": {
                "type": "Integer",
                "defaultValue": 0
            }
        },
        "annotations": []
    }
}

How it Works: A Step-by-Step Breakdown

Initial Queue Check:

The "Initial pipeline queue check" is a WebActivity that queries the Synapse API
It looks for any "Queued" runs of the specific calling pipeline (identified by TriggeredByPipelineName) that started after the calling pipeline was triggered.

Count Queued Runs:

The "Update queued run count" SetVariable activity takes the results from the initial API response.
It populates a pipeline variable, QueuedRunCount, with the number of queued instances found.

Conditional Delay (If No Initial Queue):

The "If no queued runs, then delay" IfCondition activity evaluates if QueuedRunCount is zero.
If no queued runs are found:
- It introduces a Delay (a Wait activity) for a configurable DelayInSeconds (defaulting to 1 second). This is how we achieve the micro-batching strategy.
- After the delay, a "Delayed pipeline queue check" (another WebActivity) re-queries the Synapse API for queued runs, to see if any other pipeline runs were triggered (and queued) during this delay.
The QueuedRunCount variable is then updated again with this fresh count via "Update queued run count again."

Continue or Cancel Execution:

The final "If any queued runs then cancel execution" IfCondition activity makes the decision based on the latest QueuedRunCount.
If QueuedRunCount is greater than zero:
- The "Cancel execution" SetVariable activity sets the pipeline return variable ContinueExecution to false. This signals to the parent pipeline that it should not proceed, effectively stopping further queuing.
If QueuedRunCount is zero:
- The "Continue execution" SetVariable activity sets the pipeline return variable ContinueExecution to true. This indicates that the calling pipeline can safely continue its execution, as there are no other instances currently waiting.

Considerations

The pattern described in this post allows you to decouple event-driven triggering from the actual workload processing. The DelayInSeconds parameter is key to defining your batching window. Experiment with this value to find the right balance between responsiveness and batching efficiency for your specific use case.

Remember, the "Batched Trigger Orchestrator" pipeline orchestrates the running of the parent pipeline, which is where your actual data transformation and workload logic would reside, only executing based on the instruction from the orchestrator.

Full code samples and pipeline template can be found at: https://github.com/endjin/data-pipeline-patterns

Reliably refresh Semantic Model from ADF or Synapse Pipelines

James Broome — Thu, 02 Oct 2025 06:31:00 GMT

A common requirement we see in Azure Synapse or Azure Data Factory pipelines is the need to refresh a Power BI semantic model as the final step in a data pipeline. This ensures that as soon as new data is processed, it's reflected in the reports and dashboards that drive actions and decisions.

However, there's no out the box mechanism for doing this in either Azure Synapse or Data Factory (unlike in Microsoft Fabric pipelines). And orchestrating this process introduces a surprising amount of complexity: How do we trigger a refresh programmatically? What happens if multiple pipelines try to refresh the same model at once? How do we monitor the refresh and handle failures gracefully?

In this post, I'll outline a robust and reusable pipeline pattern for Azure Data Factory or Azure Synapse Pipelines that addresses these points. It provides a reliable mechanism for refreshing any semantic model by interacting directly with the Power BI REST API. And if you're using Microsoft Fabric Pipelines, then check out my related post that explains how to achieve the same result in Fabric.

Triggering a model refresh with Web Activity

Unlike Microsoft Fabric, neither Azure Data Factory nor Azure Synapse Pipelines have a built-in activity for refreshing a Power BI semantic model. However, it's easy enough to initiate a new refresh with a Web Activity, making a POST request to the Power BI REST API's /refreshes endpoint.

While this works for triggering a refresh, it's a "fire and forget" approach. Because the API call is asynchronous - it initiates the refresh and immediately returns a 202 Accepted response - the calling pipeline has no visibility into whether the refresh actually succeeds or fails. This is pretty significant if having updated reports displaying the latest processed data from your pipeline run is important.

On top of that, the Power BI API will throw an error if you attempt to start a refresh while one is already in progress for the same semantic model. In a busy environment with multiple data feeds and trigger schedules, it's highly likely that two pipelines will eventually attempt to refresh the same model concurrently, causing one to fail. While the built-in activity offers some basic retry settings (as do all pipeline activities), this doesn't provide us with a reliable solution. We don't have a clear understanding of how long the refresh will take, nor do we know how many other processes might be trying to trigger a refresh at the same time.

A More Robust Framework: Polling the Power BI REST API

A good way to think about a more reliable solution is building a process that is aware of the semantic model's state. We need our pipeline to intelligently check the status, wait if necessary, and then monitor for completion.

The pattern described below uses a sequence of activities to manage this process:

Check for an Active Refresh: Before attempting a new refresh, we must first query the API to see if a refresh is already running.
Trigger the Refresh: If no refresh is active, we can safely initiate a new one.
Poll for Completion: After triggering, we must wait for the refresh to complete and confirm its final status.
Handle Success or Failure: Finally, we either complete the pipeline successfully or fail it with a meaningful error if the refresh did not succeed.

To implement this, we'll create a parameterised pipeline that uses the Workspace ID and Dataset ID (Semantic Model ID) as inputs. The core logic is built using Web, Until, and If Condition activities.

First, we need to check if a refresh is already in progress. We can do this by wrapping a Web Activity inside an Until loop. The Web Activity makes a GET request to the /refreshes?$top=1 endpoint of the Power BI API to fetch the latest refresh status.

{
    "name": "Until No Refresh Running",
    "type": "Until",
    "dependsOn": [
        {
            "activity": "Set Refresh Status to Unknown",
            "dependencyConditions": [
                "Succeeded"
            ]
        }
    ],
    "userProperties": [],
    "typeProperties": {
        "expression": {
            "value": "@not(equals('Unknown', variables('Refresh Status')))",
            "type": "Expression"
        },
        "activities": [
            {
                "name": "Check Initial Refresh Status",
                "type": "WebActivity",
                "dependsOn": [],
                "policy": {
                    "timeout": "0.12:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "method": "GET",
                    "url": {
                        "value": "https://api.powerbi.com/v1.0/myorg/groups/@{pipeline().parameters.WorkspaceId}/datasets/@{pipeline().parameters.DatasetId}/refreshes?$top=1",
                        "type": "Expression"
                    },
                    "connectVia": {
                        "referenceName": "AutoResolveIntegrationRuntime",
                        "type": "IntegrationRuntimeReference"
                    },
                    "body": {
                        "notifyOption": "NoNotification"
                    },
                    "authentication": {
                        "type": "MSI",
                        "resource": "https://analysis.windows.net/powerbi/api"
                    }
                }
            },
            {
                "name": "Wait 10 Seconds",
                "type": "Wait",
                "dependsOn": [
                    {
                        "activity": "Set Refresh Status",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "userProperties": [],
                "typeProperties": {
                    "waitTimeInSeconds": 1
                }
            },
            {
                "name": "Set Refresh Status",
                "type": "SetVariable",
                "dependsOn": [
                    {
                        "activity": "Check Initial Refresh Status",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "policy": {
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "variableName": "Refresh Status",
                    "value": {
                        "value": "@activity('Check Initial Refresh Status').output.value[0].status",
                        "type": "Expression"
                    }
                }
            }
        ],
        "timeout": "0.12:00:00"
    }
}

Once the request is made, we can capture the result from the JSON response body and store it in a pipeline variable called RefreshStatus. The Power BI REST API documentation explains that the status will be "Unknown" if a refresh is actively running, or "Completed" or "Failed" if it has finished. Knowing this, we can wrap the API query in an Until activity loop, polling the API until we get a response that isn't "Unknown", waiting for 10 seconds between each check. We also need to initialise the RefreshStatus variable to "Unknown" before the loop begins.

N.B. If you're using a Service Principal or System Assigned Managed Identity for your Web Activity connection, you will need to ensure the correct Contributor permission is applied to the target Power BI workspace to allow querying of refresh statuses.

Once the loop confirms there is no active refresh, we use another Web Activity to POST to the /refreshes endpoint and trigger a new one. This is followed by a second, identical Until loop that polls for the completion of the refresh we just started.

{
    "name": "Trigger Refresh",
    "type": "WebActivity",
    "dependsOn": [
        {
            "activity": "Until No Refresh Running",
            "dependencyConditions": [
                "Succeeded"
            ]
        }
    ],
    "policy": {
        "timeout": "0.12:00:00",
        "retry": 0,
        "retryIntervalInSeconds": 30,
        "secureOutput": false,
        "secureInput": false
    },
    "userProperties": [],
    "typeProperties": {
        "method": "POST",
        "url": {
            "value": "https://api.powerbi.com/v1.0/myorg/groups/@{pipeline().parameters.WorkspaceId}/datasets/@{pipeline().parameters.DatasetId}/refreshes",
            "type": "Expression"
        },
        "connectVia": {
            "referenceName": "AutoResolveIntegrationRuntime",
            "type": "IntegrationRuntimeReference"
        },
        "body": {
            "notifyOption": "NoNotification"
        },
        "authentication": {
            "type": "MSI",
            "resource": "https://analysis.windows.net/powerbi/api"
        }
    }
},
{
    "name": "Poll For Completion",
    "type": "Until",
    "dependsOn": [
        {
            "activity": "Trigger Refresh",
            "dependencyConditions": [
                "Succeeded"
            ]
        }
    ],
    "userProperties": [],
    "typeProperties": {
        "expression": {
            "value": "@not(equals('Unknown', variables('Refresh Status')))",
            "type": "Expression"
        },
        "activities": [
            {
                "name": "Check Refresh Status",
                "type": "WebActivity",
                "dependsOn": [],
                "policy": {
                    "timeout": "0.12:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "method": "GET",
                    "url": {
                        "value": "https://api.powerbi.com/v1.0/myorg/groups/@{pipeline().parameters.WorkspaceId}/datasets/@{pipeline().parameters.DatasetId}/refreshes?$top=1",
                        "type": "Expression"
                    },
                    "connectVia": {
                        "referenceName": "AutoResolveIntegrationRuntime",
                        "type": "IntegrationRuntimeReference"
                    },
                    "body": {
                        "notifyOption": "NoNotification"
                    },
                    "authentication": {
                        "type": "MSI",
                        "resource": "https://analysis.windows.net/powerbi/api"
                    }
                }
            },
            {
                "name": "Wait 10s",
                "type": "Wait",
                "dependsOn": [
                    {
                        "activity": "Update Refresh Status",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "userProperties": [],
                "typeProperties": {
                    "waitTimeInSeconds": 1
                }
            },
            {
                "name": "Update Refresh Status",
                "type": "SetVariable",
                "dependsOn": [
                    {
                        "activity": "Check Refresh Status",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "policy": {
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "variableName": "Refresh Status",
                    "value": {
                        "value": "@activity('Check Refresh Status').output.value[0].status",
                        "type": "Expression"
                    }
                }
            }
        ],
        "timeout": "0.12:00:00"
    }
}

Finally, an If Condition activity checks the final status. If it's "Completed", the pipeline succeeds. If it's anything else (e.g., "Failed"), we use a Fail activity to stop the pipeline and report the error. Now, a pipeline failure is due to a genuine data processing issue within Power BI, not a conflict with another pipeline run.

{
    "name": "Check Status",
    "type": "IfCondition",
    "dependsOn": [
        {
            "activity": "Poll For Completion",
            "dependencyConditions": [
                "Succeeded"
            ]
        }
    ],
    "userProperties": [],
    "typeProperties": {
        "expression": {
            "value": "@equals(variables('Refresh Status'), 'Completed')",
            "type": "Expression"
        },
        "ifFalseActivities": [
            {
                "name": "Dataset Refresh Failed",
                "type": "Fail",
                "dependsOn": [],
                "userProperties": [],
                "typeProperties": {
                    "message": {
                        "value": "@concat('Power BI Dataset refresh failed with status of: ', variables('Refresh Status'))",
                        "type": "Expression"
                    },
                    "errorCode": "500"
                }
            }
        ]
    }
}

Wrapping Up

With this pattern in place, we have a parameterised, reusable pipeline that can be called from any other pipeline using an Execute Pipeline activity. It intelligently waits until any previously running refreshes have completed before triggering a new one, monitors the outcome, and provides a clear success or fail result.

I've made this pipeline pattern available as a template, which you can import directly into your own environments. Alternatively, you can use the JSON source code to build it up yourself. All of this can be found in the Pipeline Patterns repository, along with other useful reusable samples.

https://github.com/endjin/data-pipeline-patterns

A Quick Note for Microsoft Fabric users

If you're looking to achieve the same result in Microsoft Fabric, the approach is slightly simpler as there's an out-of-the-box Semantic Model Refresh activity. However, the issue with concurrent refreshes still applies so parts of this pattern can be adapted for use in Fabric to achieve the same level of resilience. I've covered this in a related post!

Reliably refreshing a Semantic Model from Microsoft Fabric Pipelines

James Broome — Thu, 02 Oct 2025 06:30:00 GMT

A common requirement we see in Microsoft Fabric pipelines is the need to refresh a Power BI semantic model as the final step in a data pipeline. This ensures that as soon as new data is processed, it's reflected in the reports and dashboards that drive actions and decisions.

Of course, there's a built-in activity for just this purpose in Fabric, and it works pretty well for simple scenarios. However, there's definitely some gotchas to be aware of when you try to use it in real-world situations. In this post, I'm going to describe a more robust and reusable solution for reliably refreshing your semantic models, building on the out-the-box activity with some of the techniques and patterns I've outlined in a related post when implementing the same functionality in Azure Data Factory and Azure Synapse Pipelines.

Fabric's Semantic Model Refresh activity

Let's start with the Fabric Data Pipeline's Semantic Model Refresh activity. Once it's added to your pipeline, you just point it to an existing semantic model based on the pre-loaded list of workspaces and models that Fabric knows about. For a single, isolated refresh, it works exactly as you'd expect. Behind the scenes, this activity leverages the Power BI REST API to initiate the refresh. The process of triggering a new refresh action is asynchronous, so the activity will automatically poll the API until the refresh is complete. If the refresh succeeds, then so does the activity and your pipeline will continue, or exit successfully. But if the refresh fails, then the activity will throw and error and fail the pipeline run.

Triggering multiple refreshes will fail

However, consider a scenario where you have multiple pipeline runs attempting to refresh the same model concurrently. This isn't uncommon, with the variety of options around automatically triggering pipeline runs. If data is being updated frequently and your semantic model takes a while to process, then it's highly likely that you'll hit this scenario.

The problem here is that the Power BI API will throw an error if you attempt to trigger a new refresh while one is already in progress, and the Semantic Model Refresh activity doesn't attempt to handle this gracefully. So, whilst you can have multiple pipeline runs executing simultaneously, when it comes to the model refresh step your pipeline is going to error and report a failure.

While the built-in activity offers some basic retry settings (as do all pipeline activities), this doesn't provide us with a reliable solution. We don't have a clear understanding of how long the refresh will take, nor do we know how many other processes might be trying to trigger a refresh at the same time.

A more robust approach: polling the Power BI REST API

What we need is a mechanism to reliably trigger a refresh, ensuring that any previous refreshes have completed first. My previous post about doing this in Azure Synapse and Azure Data Factory includes all the pieces we need.

To achieve this, we need to interact with the Power BI REST API directly to query the status of running refreshes. We can do this easily using a simple Web Activity within our Fabric pipeline, parameterising the Workspace ID and the Semantic Model (or Dataset) ID to construct the URL dynamically. We'll issue a GET request to the API, including a query string parameter to filter the results and return only the latest refresh ($top=1).

{
    "name": "Check Refresh Status",
    "type": "WebActivity",
    "dependsOn": [],
    "policy": {
        "timeout": "0.12:00:00",
        "retry": 0,
        "retryIntervalInSeconds": 30,
        "secureOutput": false,
        "secureInput": false
    },
    "typeProperties": {
        "method": "GET",
        "relativeUrl": "/v1.0/myorg/groups/@{pipeline().parameters.WorkspaceId}/datasets/@{pipeline().parameters.DatasetId}/refreshes?$top=1"
    },
    "externalReferences": {
        "connection": ""
    }
}

The Connections in Microsoft Fabric work slightly differently to Linked Services in Azure Synapse or Azure Data Factory. The code sample above shows a relative Url to the /refreshes endpoint, and the Web Activity will need a configured Web v2 Connection, with:

The Base Url property pointing to the Power BI REST API Url: https://api.powerbi.com
The Token Audience Uri pointing to: https://analysis.windows.net/powerbi/api

And whichever authentication method you use, you will need to ensure the correct Contributor permission is applied to the workspace to allow querying of refresh statuses.

{
    "name": "Set Refresh Status to Unknown",
    "type": "SetVariable",
    "dependsOn": [],
    "policy": {
        "secureOutput": false,
        "secureInput": false
    },
    "typeProperties": {
        "variableName": "Refresh Status",
        "value": "Unknown"
    }
},
{
    "name": "Until No Refresh Running",
    "type": "Until",
    "dependsOn": [
        {
            "activity": "Set Refresh Status to Unknown",
            "dependencyConditions": [
                "Succeeded"
            ]
        }
    ],
    "typeProperties": {
        "expression": {
            "value": "@not(equals('Unknown', variables('Refresh Status')))",
            "type": "Expression"
        },
        "activities": [
            {
                "name": "Wait 10 Seconds",
                "type": "Wait",
                "dependsOn": [
                    {
                        "activity": "Set Refresh Status",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "typeProperties": {
                    "waitTimeInSeconds": 1
                }
            },
            {
                "name": "Set Refresh Status",
                "type": "SetVariable",
                "dependsOn": [
                    {
                        "activity": "Check Refresh Status",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "policy": {
                    "secureOutput": false,
                    "secureInput": false
                },
                "typeProperties": {
                    "variableName": "Refresh Status",
                    "value": {
                        "value": "@activity('Check Refresh Status').output.value[0].status",
                        "type": "Expression"
                    }
                }
            },
            {
                "name": "Check Refresh Status",
                "type": "WebActivity",
                "dependsOn": [],
                "policy": {
                    "timeout": "0.12:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "typeProperties": {
                    "method": "GET",
                    "relativeUrl": "/v1.0/myorg/groups/@{pipeline().parameters.WorkspaceId}/datasets/@{pipeline().parameters.DatasetId}/refreshes?$top=1"
                },
                "externalReferences": {
                    "connection": ""
                }
            }
        ],
        "timeout": "0.12:00:00"
    }
}

Once the loop exits - either because there's no actively running refresh, or it's waited for one to finish, we can continue with calling the original Semantic Model Refresh activity safely. We can also re-use the two pipeline parameters (Workspace ID and Semantic Model ID) that we're already using in our Web Activity to parameterise the settings of the Semantic Model Refresh activity, using dynamic content.

{
    "name": "Semantic model refresh",
    "type": "PBISemanticModelRefresh",
    "dependsOn": [
        {
            "activity": "Until No Refresh Running",
            "dependencyConditions": [
                "Succeeded"
            ]
        }
    ],
    "policy": {
        "timeout": "0.12:00:00",
        "retry": 0,
        "retryIntervalInSeconds": 30,
        "secureOutput": false,
        "secureInput": false
    },
    "typeProperties": {
        "method": "post",
        "waitOnCompletion": true,
        "commitMode": "Transactional",
        "operationType": "SemanticModelRefresh",
        "groupId": "@{pipeline().parameters.WorkspaceId}",
        "datasetId": "@{pipeline().parameters.DatasetId}"
    },
    "externalReferences": {
        "connection": ""
    }
}

Wrapping up

With this in place, we now have a parameterised pipeline that intelligently waits until any previously running refreshes have completed before triggering a new one. We already know that the Semantic Model Refresh activity will automatically poll and wait for the refresh to finish before it completes, and throw an error and fail the pipeline if the refresh itself fails. However, if it fails now, we know it's because something genuinely went wrong in the refresh, rather than a conflict with triggering another refresh.

This entire pipeline pattern is now reusable. You can call it from any other pipeline using the Execute Pipeline activity whenever you need to reliably refresh a semantic model, simply by passing in the relevant workspace and semantic model IDs.

I've made this pipeline pattern available as a template, which you can import directly into your own Fabric environments. Alternatively, you can use the JSON source code to build it up yourself. All of this can be found in the Pipeline Patterns repository, along with other useful reusable samples.

https://github.com/endjin/data-pipeline-patterns

A Quick Note for Azure Data Factory and Synapse Pipelines Users

If you're looking to achieve the same result in Azure Data Factory or Synapse Pipelines, the approach is slightly different, primarily because there isn't an out-of-the-box Semantic Model Refresh activity. However, with a bit of extra logic (which essentially involves making the API call to trigger the refresh, in addition to polling for its status), you can achieve precisely the same result. I've covered this in detail in a related post!

FabCon Vienna 2025: Day 1

Carmel Eve — Tue, 30 Sep 2025 06:15:00 GMT

Jess and I recently attended Microsoft FabCon (Vienna) - a twice yearly conference dedicated to Microsoft Fabric. Many topics were explored as part of the conference; unsurprisingly a lot of the content was focused around data and data engineering, but there were also many sessions on monitoring, security, governance, AI, and much more. This highlights the importance and positioning of Fabric as an all-encompassing tool for data engineering and exploration.

This post will run through day 1 of the conference, which was absolutely full to the brim with announcements.

Key Note - Microsoft Fabric: The Data Platform for the AI Frontier

Sadly the Vienna transport system failed us on the first full morning of the conference, meaning that we (and many others) had to walk the ~40 minutes to the conference centre - we'll have to wait and see whether the pilgrimmage of Fabric enthusiasts over the Danube makes it into the history books... But this meant that we missed the first 20 minutes of the Keynote, and when we arrived the announcements were already in full flow.

Developer Productivity

The first section that we saw was announcements about developer productivity, including:

MCP Servers for Fabric
All of the Fabric items are now supported in Git and CI/CD - which is a huge step forward for productionising the platform.
Tabulated view: You can now view multiple items at once, including multiple lakehouses and workspaces - which will make a big difference in productivity. I've definitely been annoyed at having to go and re-open things when moving around in Fabric so far! The tabs for different workspaces will show up in different colours, making it obvious which you're looking at.
You can now use variable libraries in shortcuts
The Fabric 2.0 runtime (including Spark 4.0, and Delta 4.0).
Materialised lakeviews now with incremental refresh, general perf improvements, and native execution engine
AI functions in Data Wrangler

Data Warehouse

General performance, scale and migration improvements
Warehouse migration assistant (upload a dacpac)
Graph capabilities now in Fabric - you can use a no-code view to build up the graph and perform graph-based analysis. It's a shame that at present this feature only exists for data warehouses, and isn't available as a view over lakehouse.
New Event House endpoint
Real time automated anomaly detection using ML

User defined functions for calling store procedures and AI functions
And, you can now call UDFs directly from Power BI, passing in measure values. This means that you can have parameterised write-back into your data, and perform analysis based on those values, which could then in turn be used to update the Power BI report. A powerful tool for interactive reporting!

Data Factory

There were two announcements around performance:

Faster data previews in Data Factory
And, Modern Query Evaluator provides huge performance improvements for Data Flows

Alongside this, the prices for Data Flow Gen 2 (CI/CD) have been decreased, using a two-tiered approach:

First 10 Minutes: 12 CU (25% reduction)
After that: Drops to 1.5 CU (90% reduction)

This combined with the performance improvement (decreasing time-spent) should make a big difference in overall costs.

OneLake:

New Shortcut endpoints: Azure Blob and Azure SQL MI,
New Mirroring endpoints: Oracle (preview), Google Big Query (preview)
OneLake table API, introducing Iceberg and Delta Lake standard APIs, for greater interoperability, with anything that can understand the new table endpoints.
OneLake storage diagnostics has been announced, which allows for diagnostics and audit over all data in OneLake, including shortcuts and mirrored databases.
New security centre (tab) in Microsoft OneLake for increased security management.

There was a lot more on this in the OneLake CoreNote later in the day!

Power BI:

Power BI modelling in the web is now supported. This was a theme amongst the Power BI announcements - with the aim of reaching feature parity between the web and desktop experience - opening the foor for MAC users!
Button slicer in GA
Bard visuals in GA
Copilot in Power BI apps (public preview). This is a great feature that allows you to use Copilot within your apps. You get returned visuals on-the-fly to answer the questions you ask. And, you can also use this functionality to discover reports within the application that might be of interest given the context.

Again, there was more on this in the Power BI CoreNote.

Maps in Fabric

This is a new Geospatial feature in Fabric used for displaying data on a map. This is separate to the new map visuals in Power BI ~~(though looks to be powered by the same ArcGis technology)~~ (EDIT: It is not powered by the same ArcGis technology!). You can display realtime data, and build up performant map layers. The maps are also very customisable, letting you display a huge breadth of data at once.

CoreNote - Fabric Data Factory: What's New and Roadmap

The next session I attended was the Fabric Data Factory CoreNote, where we explored some of the announcements in more detail.

Again, this was mostly running through a list of announcements.

Here is an overview of the roadmap:

There were a few announcements around copy jobs, including:

New connectors:
The new "Copy Job Activity" in pipelines
Support for Change Data Capture in Copy Jobs (in preview) - meaning that you only replicate changed data from a source.
And, you now have the option to "Reset Incremental Copy" - which allows you to perform a full re-ingestion on the next run.

Outside of this,

The session re-touched on the Data Flow improvements
Went into the new built-in debugging and testing of expressions in pipelines - incredibly useful as typos in pipeline expressions can be very hard to spot!
You can also now use Copilot in Data Factory to create transformations in Power Query and write pipeline expressions This will massively improve productivity in these environments.
Support for multiple schedules, granting much more flexibility in how you trigger your pipelines.
And, the addition of interval (tumbling window) schedules

There was also a section around OneLake, but given I attended the OneLake session later in the day I won't go into detail on that here.

Then they talked about the preview of Business Process Solutions, which includes prebuilt models, reports, dashboards, and AI agents for standard business processes, including integrated security and compliance. Currently the solutions focus on:

Finance
Sales
Manufacturing
And, procurement

I'm interested to see how flexible these models actually are in catering for real-world process analytics.

As mentioned in the keynote, everything is now supported in Git. There is also additional support for Workspace Identity and networking.

Here is a summary slide of the announcments:

CoreNote - Power BI Roadmap: Strategy, Vision and Demos

As I'm sure you've guessed by now, the CoreNotes throughout the day followed a similar line - exploring the wealth of new features and functionality, so this time, in Power BI...

Data Visualisations

As mentioned earlier, card visuals are now GA. You can add image heroes and customise them in a lot of different ways, which adds a lot of visually interesting options in reports.
Button slicers are also generally available. This includes partial highlighting based on data selection, and customisation.
There is additional image support where you can change the image on hover over or click, which can add an interactive feel to reports.
There are new Azure Maps features, coming with the replacement of Bing Maps, which allow you to add path layers and customise the base map.
And, you can now create organisational themes, which will have a huge impact for people who need to create many different reports within an organisation whilst preserving a coherent look and feel. Certainly for us we have long term clients for who we have created hundreds of reports. This will mean reduce the manual effort involved in creating new reports, and ensure that theming is completely consistent.
Performance analyser is also now available in the web, continuing the theme of enabling web feature parity.

Semantic Modelling

As mentioned earlier, semantic modelling is now available on the web.
Report creation is also now available in the web
Best Practice Analyzer and Memory Analyzer are now generally available. These allow you to use a one-click Fabric notebook to identify improvements to your models.
Modelling over Direct Lake and import models in now available in Power BI desktop and on the web (preview)
Tabular Model Definition Language (TMDL) is now generally available. This is a code-first way to define your semantic models. Not only does this cater for those of us who would rather use a code-first than visual approach, it is also useful for large-scale model editing.
You can now choose whether you want to refresh data, schema or both in Power BI Desktop
And, there have been some exciting DAX improvements:
- You can now create User Defined Functions (preview). This allows you to define custom functions with parameters. As someone from a software development background, seeing DAX move this way is very exciting, paving the way for useful DAX utility libraries and reusable functionality.
- Customised time intelligence - you can now define custom calendars, which enables huge flexibility in time-based analysis (you can, for example, define your calendar to coincide with the Fiscal year). You can also now perform week-based analysis, and work with "sparse date" columns, where not every date is represented.
- Copilot in DAX query view is also now GA - allowing for improved productivity whilst working on these queries.

Power BI Copilot

And then finally, around Copilot more generally:

There was a good demo about using the Copilot chat in a Power BI App. When you use it inside of an application, it is scoped to the data an the reports inside of that app. You can use it to gain visibility on the available data and reports, or ask specific questions, to which the answers may include on-demand visualisations, which can be edited and interacted with inside of the chat.

You are also able to "prep your data for AI" which means to add metadata which allows Copilot (and Fabric Data Agents) to better understand your data. There are a few things you can do:

You can remove certain things from Copilot's view
You can add descriptions to the sources in the semantic model, which Copilot will use to understand the context around the data.
You can set up "verified answers" based on certain visuals, so that when users ask specific questions it knows where to get the answer.
And, you can mark data as prepared for AI, so agents and users know that the model has been prepped for these use cases.

You can also now do this with Direct Lake.

And, there is now a Power BI Data Agent that you can connect to from M365 Copilot. This means that you can have these data exploration and report discovery conversations (with all the in-place visuals) directly from M365 Copilot.

CoreNote - Unify your Data Estate with OneLake - the OneDrive for Data

The next CoreNote was on OneLake - the storage technology which powers Fabric.

There was quite a deep dive in to OneLake Catalog, which is a powerful tool for discoverability, tagging, governance etc.

A question I had was how OneLake Catalog related to Purview. Someone asked this exact question and got an answer along the lines: "Purview is an enterprise catalog solution, OneLake Catalog is designed to Catalog all of your data estate. There are integrations between the two, meaning that you can encompass your OneLake Catalog data within your enterprise solution". To understand a bit more about how this works in practice, keep your eyes open for coming deep dive on the topic!

There was then an exploration of shortcut transforms. Shortcuts allow you to create a pointer to data stored in another system (within Azure, or elsewhere - e.g. in other clouds such as S3) without replicating the data. Shortcut transforms allow you to automatically apply transforms as part of this process, for example, transforming from CSV to Delta. This specific transform has been available for a little while, but you can also now transform between Parquet and Delta, and JSON and Delta, as part of the shortcut. Alongside this, there are also new "AI Transforms" which allow you to do things like sentiment analysis without setting up an entire ETL process.

They then focused on how you can use mirroring to bring catalogs together. Mirroring syncs the metadata from whatever service you're using (be that Databricks, Snowflake, etc.), and then, if the data is in an open table format, uses shortcuts to mirror the data itself. For proprietary data, it creates a replica and uses change management to keep it up to date. More on this in day 2, where we attended a deeper dive into shortcuts and mirroring in OneLake!

You can also now use Fabric Agents over shortcuts and mirrored data - meaning that you can include this data in your natural language queries, and connect to it from Azure Foundry.

As mentioned earlier, shortcuts to Azure Blob Storage and Azure SQL MI are now GA. And, mirroring support for Oracle and Google Big Query, and some SAP mirroring, is now in preview.

There is a new Onelake.Table API, which is interoperable with Snowflake, and can be used with DuckDB, Spark, and anything that can understand table endpoint.

As mentioned in the KeyNote, there is a new preview of OneLake Diagnostics, which allows you to monitor usage for audit and debugging, including for shortcuts and mirrored data.

There is also a new "OneLake Security Tab" which lets you view and define security within onelake, including mirrored items. This provides a centralised place for managing users and roles accross your data estate.

A lot of these OneLake announcements are hot off the press, so I'm sure we'll see more information about what's on offer and examples of usage in the coming weeks.

CoreNote - What's next for Fabric Data Engineering: Roadmap and Innovations

The final session of day 1 focused on data engineering.

Ingest + Connect

We re-touched on shortcut transformations
You can now add data connections in notebooks. This involves adding a connection via the UI within the notebook, which then allows you to autogenerate cells containing the connection code. As someone who has to go and look up the syntax for these connections every time, this feels like a great quality of life improvement!
The OneLake Catalog integration in Spark lets you explore available items from the UI.
And, they demonstrated using the Spark Connector for SQL databases (announced in March) to read/write from SQL databases directly from Spark.

Configure + Scale

For setting up new Spark tools, they are introducing the concept of "Spark Resource Profiles". Using these, instead of having many settings to understand and fine tune, you'll just answer a series of questions about your workload and the configuration will be fone for you.
Coming soon: Custom live pools in Fabric! Until this point, anyone who wanted to use a custom pool rather than the starter pool had to pay a start up cost. With this feature, you'll be able to mark custom pools as "live and configure schedules and deactivation times.
Fabric Runtime 2.0 is being relaeased, which includes Spark and Delta 4.0. Alongside this, they are releasing a feature called "early runtime access" where you can try runtimes before they are put through to full release. The idea with this is that you can set up a test environment and migrate that over early. If anything breaks, you can then report this to the Fabric team and they can take this into account before releasing.
There have also been some big performance improvements over the past few months for the Native Execution Engine. The headline figure is that, using this engine, you can achieve up to 6X faster performance than using OSS versions of the Spark runtime. You currently need to enable the use of the native execution engine (which I'd definitely recommend doing if you haven't already!), but eventually it will be turned on by default. There are still cases where processing will "fall-back" to normal, but the scenarios the engine can cope with are increasing all the time - for example, CSV support is coming soon!
There also should be big performance improvements for installations when publishing environments coming in October, and faster session start up times when you've got custom libraries installed!
And, finally, you can now share sessions across up to 50 notebooks, rather than the current 5. This is a big win for avoiding those start up times when doing concurrent work in Fabric environments. However, it is worth noting that if you are working in many notebooks concurrently, you need to watch out for "out of memory" exceptions, as the nodes' memory will be shared!

Transform + Model

Alongside the "quality of life" improvements in configuration, there have been quite a few improvements specifically for notebooks:

You can now see notebook version history across multiple IDEs, allowing for much more flexibility.
User defined functions can be accessed directly from notebooks, they are now discoverable via intellisense, and you can now pass dataframes.
Improved error messaging in Fabric notebooks.
Insights into why Spark pools might be taking longer to set up e.g. Fabric starter pools all in use, rare but a possibility.

And, there is a new Spark Monitoring library, and improvements in the monitoring tabs, which allow you to see what's going on in notebooks whilst they are running. You can use the monitoring library from a notebook to get insight into job statistics like run durations, number of queued/rejected/running jobs, and how many cores are in use.

And soon, you'll be able to see resource allocation across a workspace, and across the whole Fabric capacity.

Alongside this,

The Notebook utils APIs are now generally available.
They have added much better GeoSpatial support, using the ArcGIS technology, adding 180+ geospatial analytics functions to Fabric Spark.
There are new AI functions available in Data Wrangler.
Python notebooks are now generally available.
Python 3.1.2 is coming soon.
And, you will soon be able to connect to the real-time hub directly from the Notebook object explore.

A lot of this is still "coming soon" but they assured that the majority should be done this quarter. You can see that there's a huge wealth of exciting new functionality and features coming, both in the data engineering space, but also in Fabric as a whole!

Overall

That's a wrap on FabCon Day 1, look out for the next blog which provides a summary of day 2, where both me and Jess attended more technical "deep dive" sessions. Hope you've enjoyed this announcements firehose, I certainly feel excited by the direction that Fabric is going, and am looking forward to exploring some of these new features myself over the coming months!

If you're interested in Microsoft Fabric, why not sign up to our new FREE Fabric Weekly Newsletter? We also run Azure Weekly and Power BI Weekly Newsletters too!

Introduction to Technical Architecture

Carmel Eve — Mon, 18 Aug 2025 05:30:00 GMT

In this episode, Carmel delves into the realm of technical architecture, highlighting its importance in software development.

What You'll Learn

Carmel breaks down the often-confused world of architecture types, demonstrating:

Enterprise vs Technical Architecture - understanding the strategic versus implementation layers
The dual nature of software architecture - structure as a noun, vision as a verb
Significant design decisions - identifying choices that will be costly to reverse
Technical leadership principles - bridging business requirements and implementation
Cross-cutting concerns - managing the relationships between system components

Key Insights from This Episode

Discover why technical architecture goes beyond just drawing diagrams:

How architects decompose solutions into constituent parts whilst maintaining coherence
Why platform choices, design patterns, and abstraction levels require careful consideration
The critical role of architects in preventing project fragmentation
Essential communication skills needed across stakeholder levels
The balance between current needs and future scalability

Who Should Watch This

Whether you're an aspiring architect, a senior developer looking to step up, or a project manager wanting to understand technical decision-making better, this video provides the foundational knowledge you need. Carmel's clear explanations make complex architectural concepts accessible whilst maintaining technical rigour.

About This Series

This is the first episode in endjin's comprehensive Technical Architecture series. Over the coming weeks, we'll explore architectural drivers, patterns, anti-patterns, and real-world case studies from our extensive project portfolio. We help small teams achieve big things through considered architectural decisions.

Chapters

00:00 Introduction to Technical Architecture
00:11 Types of Architecture
00:29 Focus on Technical Architecture
00:42 Simon Brown's Insights on Software Architecture
01:37 Role and Responsibilities of a Technical Architect
02:27 Conclusion and Next Steps

Hello and welcome to a series on technical architecture. In today's video, we'll explore what these types of architecture really mean and why it's crucially important for successful software development.

Firstly, I should start by calling out the fact that there are different types of architecture. Enterprise architecture focuses on strategy rather than code, and it is the strategy for how people, processes, and technology is utilized throughout an organization in order to be effective and efficient. The type of architecture I'll be focusing on in this video is technical architecture. This is a lower level of abstraction, which focuses on the technical detail and covers both software and data architecture.

In Simon Brown's book on software architecture, he states that software architecture has a dual meaning. Firstly, as a noun, it signifies structure. It is the result when solutions are decomposed into their constituent parts, including all cross-cutting concerns and relationships. As a verb, it translates to vision. It means to take the project drivers and turn them into a technical roadmap. And this concept equally applies to data architecture. This is the case for many of the core architecture principles that we'll cover in this series, and it is the reason for the umbrella term "technical architecture."

So technical architecture is about making the significant design decisions—the ones which will be costly if you change your mind about. These can include, but aren't limited to: levels of abstraction, technology choices, platform choice, design approach, and patterns, and loads more.

In this book, Simon Brown also states emphatically that it is the job of the architect to introduce technical leadership into a project. This means being the person who can see the bigger picture, whilst also understanding the technical implications of decisions. The architect has to bridge the gap between business requirements and technical implementation. They guide the team through complex trade-offs and ensure that the chosen direction aligns with both current needs and future scalability. They need to communicate effectively with stakeholders at all levels—from developers who need clear technical guidance, to project managers who need realistic timelines, and business leaders who need to understand the value and risks of any architectural choice.

Without this technical leadership, projects can quickly become fragmented, and individual developers can make isolated decisions that don't contribute to a cohesive whole.

In this video, we have covered a brief introduction to technical architecture and the responsibilities of an architect in a project. In our next video, we'll deep dive into architectural drivers that shape projects, like functional requirements and constraints. These are the foundation that every architect has to understand in order to create an effective solution.

Thanks for listening.

Rx.NET Packaging Plan 2025

Ian Griffiths — Mon, 04 Aug 2025 04:30:00 GMT

Ian Griffiths provides an update on the state of Rx.NET since our last talk in June. Ian addresses the current choices around packaging Rx, focuses on the "package bloat" problem and discusses the strides we've made in addressing this issue.

Ian explores the new "Rx Gauntlet" test suite designed to validate proposed solutions and highlight the use of automated tests and Power BI reports to identify and solve packaging issues.

He also compares two primary design options for future Rx versions, weighing their advantages and potential risks.

Finally, he seeks community feedback to guide our next steps in releasing a stable Rx.NET v7 version.

You can find the detailed write up of this topic via the GitHub Discussion: Packaging Plans July 2025 #2211

Contents:

00:00 - Introduction and Update
00:12 - Addressing the Bloat Problem
01:34 - Evidence-Based Approach and Prototypes
02:09 - Introducing RX Gauntlet
02:56 - Testing and Results
06:54 - Power BI Report Insights
10:44 - Design Options and Future Plans
14:55 - Community Feedback and Next Steps

This talk was given during endjin's internal weekly "show & tell" meeting, but has been edited for public consumption.

Back at the start of June, we put out a video about the state of Rx this year. This is kind of an update on what we've done since then.

One of the main areas of focus of that talk was the situation around the bloat problem, which is the thing we call the situation where you've written an application in .NET, it targets Windows, and you are not actually using WPF or Windows Forms, but you get a copy of them anyway because Rx sort of forces that on you if you're doing self-contained deployment.

That's a problem that has caused some projects to walk away from Rx, and we have been trying to work out how best to fix that without breaking everything for everyone else. I talked a bit about our plans last time, and one of the things I mentioned there was a plan to introduce a new way of doing tests, partly so we can avoid this sort of problem in the future, but also so that we can actually validate whether particular design approaches are going to work or not. Rather than just arguing about this, we'll do this, that we'll do that. We've had a slight sort of more practical, more evidence-driven approach because we did actually announce right towards the end of last year, yes, we're moving forward with this plan. And then, hopefully some people in the Rx community pointed out some flaws in that plan.

Fortunately we didn't go ahead and do that. Thank you Rx community for saving our necks there. And so we have moved in a more evidence-based direction. Essentially we've done a couple of things. We have actually gone ahead and built various prototypes to implement various different attempts to solve this problem. And we've written some tooling to run automated tests against those solutions because one of the problems, one of the difficulties with these problems is that there are so many little variations you can test and a lot of the problems only occur in certain, quite specific combinations of scenarios. And so unless you automate the testing, it's incredibly easy to miss things.

We talked last time, I talked last time about this proposed test suite that I called Rx Gauntlet, and that now exists. So if I can actually get the focus in the right window. There we are. So in the Rx repo today, it's currently on a separate branch, the feature/rx-gauntlet branch, but it's in there. You can go and take a look at it. There is in the Rx.NET test folder, there is a gauntlet subfolder, which contains the source code for this suite of tests. And the big difference between this and normal unit tests is that this can reach places unit tests can't reach. It can see problems that emerge not inside the code itself, but in the way that the packaging interacts with other things.

And so the thing that we can do in these tests is we can actually generate new projects on the fly and change their configuration programmatically so we can do a hundred variations. In fact, one of the tests here generates, I think 720 different variations on a theme of the basic theme being, I've written an application that uses two libraries, both of which are using Rx. And there's a huge number of variations on that, some of which run into certain problems on certain designs for how to solve the bloat problem. And so this just runs all of them. So the idea is you come up with the possible solution and you run it through the gauntlet and it will tell you how well it stands up.

We're going for this evidence-based approach where we actually build out the proposed solution and then we run it through the gauntlet and we see what happens. So you can find the actual examples. We've got several branches with various different variations on the packaging in progress. I'm not going to go through them now, but they are described in the announcement here. There's a discussion up on the repo saying what we've done, tells you about the test suite, tells you about the various versions we have prototyped and made available on the public Rx preview package feed. I'll just go to that. This is the preview package feed up on Azure DevOps.

By the way, if you go here, it's possible that you will get an access denied error. That happens if you are logged in to Azure DevOps. What happens is Azure DevOps goes, oh, you're logged in. I will try and show you the version of this page that gives you access to all the things that maintainers of the project can see. And then it goes, but you are not a maintainer of this project and goes, therefore you can't have this. However, if you're not logged in, it shows it to you just fine. Which is kind of odd, but there we are. So if you can't see this, try opening a private browser window and then you should be able to see it.

And so if we look at some of the packages in here, you can see this is the System.Reactive package. There are a bunch of 7.0-preview-something packages, and these ones here are all trying out variations on how we might solve the packaging problems and exactly what these mean. This is all described in the discussion documents, so I'm not going to go into that now.

And in some cases there's also WPF specific packages, Windows Forms specific ones, and depending on which design option we go for, there might also be a new System.Reactive.NET package. This becomes the new main face of Rx. And System.Reactive gets relegated to being a legacy facade. There's two design options. One where that does happen and one where that doesn't, and I'll come back and talk about that in just a minute.

So this gauntlet exists and what it produces is enormous quantities of JSON saying what it found for each of the test cases that you ran. So we basically spit out a thing saying, okay, this one we were trying to build for a target of .NET 8. And in this particular one we turned on the EmitDisabledTransitiveProjectReferences flag. Basically all the settings that go into this. And then we say, did we actually see a copy of WPF or Windows Forms deployed into the output of the app. So we're testing for bloat in this particular case. Actually, there's several different things we test for here.

Gauntlet tests for bloat with direct usage and through transitive references. It also checks for a potential bug that can occur with extension methods if you get certain things wrong in the design. Test for all the things that we know might go wrong with certain kinds of designs. And so you end up with lots and lots and lots and lots of JSON to pour through, which is not particularly easy to find things in. So actually finding the things that tell you something is wrong is a bit like looking for a needle in a haystack. So we have also produced this thing here. This is a Power BI report that sits on top of the data and basically pulls out the things that have the problems for you.

So this page here tries to find all of the versions that suffer from package bloat. Let me just try and narrow this down. So you can see here that Rx versions 5 and 6.0.1s, these are published versions of Rx that are up on there today. They have the bloat problem, so we didn't have it before Rx 5 because it only started to occur when self-contained deployment became a possibility and certain versions of WPF and Windows Forms shipped.

And you can see from this that all of the preview versions that we're trying don't suffer from bloat if you just use Rx directly from your application. That's what this page is showing us. This has gone through all the results and it's just showing us the ones that fail. This shows which things have the plugin bug, which is actually an older bug that was fixed, that was reported back in the Rx 2 days, was fixed in Rx 3.1, and then that fix got reverted. And Rx 4 didn't actually cause a problem at the time, but all subsequent versions have actually suffered from this bug. Our current plan is to not fix this because no one has ever actually complained about this and we think it probably doesn't matter as much now as it used to because the scenarios in which it does occur are much narrower than it used to be. But this does look for that. So if people think that is a problem, we can still use this to find it. At the moment, there isn't really a good way to solve this.

This looks for the extension method bug. So this is the thing that our former proposed workaround ran into. We thought we had a workaround for this, for preventing bloat with the existing published version of Rx, but it turns out you get build errors with certain extension methods under certain situations, and this basically looks for that. And again, this is showing that that problem exists. If you try to disable transitive framework references in Rx version 5 or version 6, it doesn't work. But all the candidate builds we have for fixing the problems, actually, it's okay. You don't get these errors because we have worked out how to work around them.

Now, this Power BI report wasn't the whole story. We also found that with the transitive referenced situation, it was all rather complicated and so we actually found it was helpful to write a notebook to do some more control processing. More to do with the fact that I'm better at notebooks than I am at Power BI. You probably could do it with Power BI, but Notebook was the quickest way to get there, and this kind of gives you ultimately a top level view of how the various options stack up against some of the more subtle problems that can occur when your application may or may not be using Rx directly, but is using it indirectly through other components. And so you can't necessarily control exactly how those things are using it. And I'm not going to go into the details here because we've published this report. You can go and read it if you are interested. But basically you get this kind of color chart showing you that some options work better than others.

That one down the bottom is one of our two design candidates that we're now still thinking of doing. You can see it's green on almost everything with one rather significant red that I'm going to talk about in a moment. And the other one we are contemplating is actually the second row in here, although that's got more yellows in it and does actually have one red there. It turns out it's more of an artifact of how this notebook works. So it's got more yellows than greens, but it doesn't suffer from this big major problem down here. So let's talk about that.

So if I go back to the webpage that has the announcement of what we're doing, it says we've essentially, we've looked, we've got four fully worked out prototypes, but we've essentially boiled it down to two that we think we're going to use. And essentially one of those retains System.Reactive as the main Rx assembly. That's today's main Rx package. It's the main Rx package today, and we want to keep it as the main Rx package in the future if possible, because that's the least disruptive thing to do, and one of these solutions does do that. The other one doesn't. So why are we contemplating one that doesn't? Well, there's a serious issue with what we've had to do to make it work. If we want to retain System.Reactive as the main package, basically we have to play tricks with how the package is created. What we end up doing is creating reference assemblies that deliberately leave out certain problematic bits of the public API. But that public API is present in the runtime assemblies, and so that means you get binary compatibility because everything is there in the runtime assemblies. But some of the problems that we used to have with the workarounds for this go away because the compiler can't see them because it will use the reference assemblies.

This seems to work, but we're kind of unhappy with it because it's a clever trick and the history of applying clever tricks to try and work around packaging problems in Rx does not look good. Basically every time we do it, it turns out we're not as clever as we think. Or more often what happens is that things change and stuff that was just fine when it was introduced goes on to cause problems as a result of other things going on in the ecosystem. And so we actually have a preference to try and keep things simple. Ideally we would be doing nothing weird, nothing unusual. We build Rx in a completely conventional way. And this would minimize the chances of things going odd in the future. But the only way we seem to be able to do that is by introducing a new main package for Rx and turning the existing one into a legacy facade. And that's the only way we can really get a clean, ordinary looking build for the main Rx package.

And so that option, introducing a new main Rx package, has the attraction of being less likely to be a liability in the long run. So that's why we are considering that as the other design option is because it's less likely to break for reasons that aren't even visible yet but that will come to haunt us in the future because that has happened several times before with Rx and we don't want to be the latest in a line of those.

So that's where we're at. We've got these two options and actually there's sort of a fudge option, which is that we do one and then the other. One possibility is that for Rx 7 we do the approach where we don't introduce a new main package. We just keep System.Reactive as the primary package for Rx as it is today. But we use this hack to deal with some of these packaging problems and if that works out fine, then great. And if it turns out to cause problems, then maybe in Rx 8, we could do the thing where we introduce a new main assembly and we relegate System.Reactive to now being just a legacy facade that's there purely for backwards compatibility purposes. We could conceivably do that in two steps. So the debate at this point now really is do we actually just go straight for the solution where we just don't mess around with the main package, or do we see if we can get away with this clever trick on the System.Reactive package so that we minimize the disruption, understanding that we might then discover that actually doesn't work after all, and we have to do the package split in the long run anyway.

So those are the choices before us. Or maybe someone has other ideas. So we've got this place to go and vote on this discussion. If you think there's a thing we've missed, then you can vote for the none of the above option here. This is not a democracy, but this will tell us how people are feeling and we'll take it into account. So we haven't decided yet, but we'd like to decide soon because we would really like to get at least a beta release of Rx 7 out there, and then a proper release as soon as possible so we could finally fix this and get back to doing other things to make Rx better rather than just trying to fix build issues.

So there we are.

Medallion Architecture in Practice

Carmel Eve — Tue, 29 Jul 2025 05:30:00 GMT

In this episode, Carmel Eve delves deeper into the connection between the medallion architecture and the semantic layer, exploring when data becomes truly production ready.

Carmel recaps the three tiers of the medallion architecture: raw bronze, cleaned silver, and opinionated gold layers. The focus is placed on the semantic layer, which adds context and meaning to the gold layer data, making it useful for humans, machines, and AI. This semantic layer includes human or machine-friendly names, advanced data types, metadata for governance, and relationships between data objects.

Carmel also discusses the environments for software development—development, testing, and production—and how this multi-environment system applies to the medallion architecture. Finally, we highlight the advantages of using this pattern, such as supporting multiple use cases, enhancing data lineage, and ensuring data's flexibility and security.

Chapters

00:00 Introduction and Recap
00:06 Understanding the Semantic Layer
01:21 Defining the Semantic Layer
03:02 Data Production Readiness
03:13 Development, Testing, and Production Environments
04:43 Applying the Medallion Architecture
09:30 Conclusion and Key Takeaways

Welcome back. In our previous video, we covered an introduction to the medallion architecture. Now let's understand how it connects to the semantic layer, plus we'll discuss when data becomes truly production-ready.

As we saw last time, after passing through the three tiers of the medallion architecture, we arrived at an opinionated or projected view of the data in our gold layer. This gold layer forms the basis of a semantic layer, but the semantic layer adds additional context and meaning to the tables in the gold layer, which enables humans and, more so these days, machines and AI to understand and engage with data.

So this could include human or machine-friendly names, descriptions, and synonyms of objects like tables and columns. It could include assigning more advanced data types over the primitive types that are used in the gold layer – so saying like this number is a percentage, or a latitude, or a currency, or this string is a city. It could include adding metadata to help you enable effective governance of the gold layer, like data classification tags. And it could mean defining relationships between your objects.

Often the idea of the gold layer is conflated with the semantic layer, but without defining your objects' meaning and relationships, you can't use that data in any meaningful way. The semantic layer usually sits outside of the lakehouse, and its exact form will be dependent on your use case.

A good example of this is Power BI. So when you're creating a report, you import the data from the gold layer, and this forms a part of the model. But you then update column names and types to make the report more readable, and enrich the data, and use the modeling tab to define the relationships between your tables and objects. And only at this point is your semantic model fully defined.

Power BI is a useful example because the Semantic Model is a built-in concept, but your semantic layer could also be defined using other tools like Microsoft Purview or Databricks Unity Catalog. Whatever your use case – be it reporting, analytics, or application development – you're likely to define a semantic layer in order to describe your data and give it meaning.

Each output from the semantic layer is a data product – a valuable asset that should be maintained, versioned, and treated as a fully contained product. You may have multiple data products or versions of each, which are consumed by different use cases or used for different types of analytics.

So overall, we have: data is ingested from on-premise or cloud systems into the raw bronze layer, is then processed to the cleaned and validated silver layer, and projected to the opinionated gold layer. Data is then defined as it moves into the described semantic layer, where it then goes on to serve actionable insights. The bronze, silver, and gold layer usually sit inside of the lakehouse, but the semantic layer often sits outside.

So now on to our next question: when is data production-ready?

In software development, we often talk about different environments, and at endjin we often use a three-tiered approach with development, testing, and production.

So in general, the development environment is where engineers are currently working and is therefore the most volatile. New code will be deployed here, and bugs will often be found during this first stage. Things will be changing rapidly, and if the project is actively being worked on, extremely rapidly. Nothing production-focused or client-facing should ever depend on the development environment, as it's purely a place for developers to make changes, trial solutions, and update things.

Once developers are happy with the changes made, and hopefully those changes have passed some kind of quality gate, they'll be deployed into the testing environment. Here, the code undergoes more rigorous testing in a more controlled environment. There might be additional tests including integration tests or non-functional tests, and anything else that doesn't generally fit into a quick feedback loop that isn't necessary during development. Hopefully at this point, any bugs that have slipped through during the development stage will be found.

Finally, once all the tests have passed and all the validation has been carried out, the code will be deployed into production. This is the live code that your wider solution depends on, and if, for example, you're hosting a client-facing web application, this code will drive your public app, and you would therefore hope that it is reliable and bug-free.

This is a generalized pattern that's applicable in loads of scenarios, but the number of environments and their purposes can vary. For example, you might require a pre-production environment for additional validation, or a specialized QA environment in order to meet regulatory requirements.

So how does this apply to our data design pattern? It's slightly confusing that we now have three tiers in both of these separate but related dimensions. In the medallion architecture, data moves from raw to clean to projected. But alongside this, we still want an environment in which engineers can experiment and change things about data – including the silver and gold tier – without this impacting anything public-facing (our development environment). We also want an environment in which changes can be validated (testing), and an environment in which we can be as sure as possible that data is reliable and can be depended upon (production).

So as such, we can design a system as follows: in each of our environments, we have a bronze, silver, and gold tier. It's clear that we don't want any production systems relying on our bronze raw data, as any new data that arrives needs to be cleaned and validated before it can be used. But as is the case for software development, anything that is end-user facing should be relying on the production environment.

The development environment is where engineers are currently working and is therefore volatile. These engineers will need to update things relating to the silver and gold tier and will need to do so without worrying about affecting these production systems. There could be necessary schema changes, column renames, bugs introduced accidentally in calculated columns, tables accidentally deleted, and much more.

It is worth noting that though the environment is functionally volatile, the data itself may be more tightly controlled. Data needs to be consistent in order to allow for development and testing, and test data might be created specifically to hit edge cases, or fake data used to restrict access to production data. Any changes that are made in the development environment are then validated in the testing environment, and once those quality gates have been passed, they're deployed in production.

And within the production environment, nothing should be depending on the data in the bronze tier, as this data is unvalidated and raw. But once we've cleaned and validated the data, both silver and gold tier should be production-ready. At this point, the data's cleaned, validated, and not subject to unpredictable changes as it is in the development and, to some extent, testing environment.

As I mentioned in my last video, it's often stated that data quality improves as you go through the medallion architecture. However, this statement is flawed. The data in the silver tier is no less production-ready than the gold tier. It is just an unopinionated representation of your data. It might be used to feed machine learning models and data science experiments, and the results of those may well be client-facing. Therefore, data in your silver tier in the production environment is very much production data and should be viewed as such.

So to answer our earlier question: when is data production-ready? When you're in your production environment, data is production-ready when it is in the silver or gold tier.

Overall, the opinionation of our data increases as it moves through the medallion architecture. We start off with data in its completely unaltered raw form, and we end with data that is first cleaned and then structured for a specific use case.

Using medallion architecture, we can:

Land and then work with data in any format, whether it is structured, semi-structured, or unstructured
Impose gates that limit data quality issues
Support multiple use cases
Support different workload types, including reporting, machine learning, and data science
Support recreation of tables at any time
Support auditing and data versioning
Allow for historical playback, as you have each version of the raw data saved
Allow for greater agility where, for example, customer change requests can be dealt with by just updating the gold projection
Define data lineage and how data moves from source through processing into consumption
Allow for flexibility and security, with certain groups being given access to different projections with different levels of sensitivity

An important point is that there's no hard and fast rules for implementing this pattern. Though the medallion architecture provides us with useful guidance on how to structure our data solutions, there's a lot of nuanced decisions that need to be made. For example: where exactly to draw the lines between the different layers, how much validation is done at each stage, and how much processing is done in the gold storage layer versus how much in Power BI. Each of these require a balancing of performance, flexibility, data copying and storage costs, data volumes, historical data support, security, regulatory requirements, and loads more.

Overall, the combination of this data design pattern and a multi-environment system in which data reliability increases as it's promoted through the environments provides a reliable and flexible architecture that can support many different scenarios.

That wraps up our look at the medallion architecture. We've covered the three tiers, the semantic layer, multi-environment deployment, and key advantages of this approach.

Thanks for watching. I hope you found this valuable.