January 29, 2026

Improving Frontend Regression Testing with Chromatic

January 29, 2026

After recently migrating our frontend to Remix, we took the opportunity to reassess how we approach frontend testing, particularly regression testing. While we already had unit test coverage, we identified a gap when it came to validating UI changes. This is where Chromatic became a part of our frontend testing strategy. This post outlines why we introduced Chromatic and how it fits into a Remix-based workflow.

Even when application functionality remains unchanged, subtle visual regressions can still be introduced. Changes to spacing, typography, layout, or component states can easily slip through without being caught by traditional tests.

What we needed was a way to automatically detect meaningful UI changes while still fitting into our existing development workflow. At the same time, it was important to avoid introducing a fragile or high-maintenance testing setup, one that adds overhead without delivering proportional benefit. Our implementation with Chromatic attempts to balance automation, reliability, and developer experience as a practical addition rather than an extra burden.

Why Chromatic?

Chromatic provides visual regression testing on top of Storybook. Instead of testing components purely through assertions, Chromatic renders components in a real browser environment and captures screenshots. These are then compared against a known baseline to highlight visual changes.

The key reasons we chose Chromatic were:

Automated visual diffs that are easy to review
Integration with Storybook, which we already use for component development
CI-friendly workflow that fits well into pull requests

Chromatic offers two closely related features for reviewing UI changes: UI Review and Visual Tests. While they have some overlap of functionality, they serve different purposes and are designed for different levels of enforcement. UI Review is enabled by default, whereas Visual Tests are an optional feature.

UI Review generates snapshots that highlight differences against a baseline, with a focus on collaboration and feedback. A key characteristic of UI Review is its flexibility: anyone including the author can approve the changes. The approval is also persistent, once a UI Review is approved, further commits to the same branch do not require re-approval.

Visual Tests also generate snapshots and compare them against an approved baseline, but they treat those comparisons as authoritative. Visual Tests can be configured to run across multiple browsers or viewports. Unlike UI Review, any subsequent changes committed to a branch will require re-approval. Approval permissions can also be restricted to specific roles. Visual Tests incur additional cost, and testing across multiple browsers or viewports increase the number of snapshots generated.

Storybook as the Foundation

Chromatic works best when components are well-represented in Storybook. As part of our migration to Remix, we invested time in creating stories for our components.

Defining common UI states
Mocking Remix loaders and actions where needed

Chromatic simply builds on top of this foundation by continuously validating those stories.

Automatic API Mocking with OpenAPI

To keep our Storybook components realistic without manual effort, we built an automated pipeline that generates stories from our OpenAPI schema.

We use the swagger auto schema to document what each api endpoint returns, including example responses. A script then parses this backend schema alongside the remix route definitions to generate complete stories, including MSW handlers with example responses that exactly match real API contracts. The handlers are created for the full route hierarchy needed for a component to be rendered.

This means Storybook stories always stay in sync with the backend, and there is only a single source of truth to maintain, which is the OpenApi schema.

We use the below to parse the source code for a component to find imports from the API client.

  
    const inspectModule = (filename: string) => {
  const code = readFileSync(`./app/$`, 'utf-8')
  const ast = parse(code, { jsx: true })
  const requiredMocks = ast.body
    .filter(
      (node) =>
        node.type === 'ImportDeclaration' &&
        [ '~/api/Api'].includes(node.source.value)
    )
    .map((node) =>
      (node as unknown as { specifiers: [{ imported: { name: string } }] }).specifiers.map(
        (specifier) => operations[specifier.imported.name]
      )
    )
    .reduce((acc, val) => [...acc, ...val], [])
    .filter(Boolean)

  

And then we use the below to build the response for the mocks

  
    const buildExample = (schema: OpenAPIV3.SchemaObject): Example | undefined => {
  // recursively build an example from the examples given in the schema.
  if (typeof schema.example !== 'undefined') return schema.example as Example
  if (schema.type === 'array') {
    const example = buildExample(schema.items as OpenAPIV3.SchemaObject)
    if (typeof example === 'object' && example !== null && !Object.keys(example).length) return []
    return example ? [example] : []
  }
  if (schema.type === 'object') {
    return Object.entries(schema.properties ?? {}).reduce((result, [key, value]) => {
      return { ...result, [key]: buildExample(value as OpenAPIV3.SchemaObject) }
    }, 
  
  return undefined
}

  

How Chromatic Fits into the Workflow

Once set up, the Chromatic workflow is straightforward:

A developer opens a pull request
CI builds the Storybook and uploads it to Chromatic
Chromatic runs visual comparisons against the baseline
Any detected UI changes are surfaced directly in the PR (as below)

From there, snapshots before and after with the changes highlighted can be viewed in storybook

The changes need to be Approved in Chromatic which makes visual review explicit and intentional.

Lessons Learned So Far

Good Storybook coverage is essential. Chromatic relies on developers creating and maintaining stories for newly developed components, and Storybook itself needs ongoing care to remain a useful and accurate representation of the UI.
Entire pages can be snapshot tested as a single component, but doing so requires a fair amount of mocking (Which we have automated)
Baseline discipline is important and accepting visual changes should be a deliberate action.
Chromatic is most effective when treated as part of a broader testing strategy, rather than a silver bullet.
Finally, how restrictive or unobtrusive Chromatic feels in day-to-day development depends largely on how it is configured. Decisions such as whether to enable Visual Tests, and how approvals are gated all influence both the cost of the tool and its impact on developer workflow.

Final Thoughts

As our Remix application continues to evolve, Chromatic gives us confidence that UI changes are intentional and understood, without relying solely on manual checks.

Migrating to Remix was a natural point to rethink how we test our frontend. By adding Chromatic to our toolchain, we’re reducing an important gap in regression testing, one that traditional tests struggle to cover effectively. Visual regression testing doesn’t remove the need for thoughtful development or careful review, but it does make those processes more reliable and scalable.

December 8, 2025

Victor Wenas

Patterns & Best Practices in Event-Driven Systems

December 8, 2025

Victor Wenas

Designing Robust, Scalable, Maintainable Event Architectures

Event-driven architecture (EDA) gives teams the ability to build decoupled, scalable systems that evolve independently. In the previous article, we introduced the idea using a restaurant analogy: instead of shouting instructions across the kitchen, teams place “dockets” on the rail and stations take what they need.

We’ll continue that analogy lightly in this post—sprinkling it here and there—while focusing on the engineering patterns that make event-driven systems work in practice.

Core Patterns in Event-Driven Architecture

Pattern 1: Event Notification

An event notification is a tiny message that simply declares “something happened.”

It doesn’t contain all the details—just enough for downstream systems to react. Think of it like a kitchen bell dinging: The bell doesn’t contain the meal. It’s just a signal. The cook still needs to check the ticket rail (the database) for the details of the order 12345. Example

{
  "eventName": "OrderCreated",
  "orderId": 12345,
  "createdAt": "2025-11-26T01:00:00Z"
}

Why it’s useful
- Extremely lightweight
- Easy to publish, easy to fan out
- Consumers decide how much extra data they need
Trade-offs
- Consumers must fetch details themselves
- More cross-service calls → more coupling
- Higher latency when many consumers query upstream systems

Use this pattern when the event is a simple trigger—like a bell, not a full meal.

Pattern 2: Event-Carried State Transfer (ECST)

In Event-Carried State Transfer, the event carries all required data so consumers don’t need to make additional calls.

It’s the equivalent of the chef not only ringing the bell but also placing the complete plated dish on the pass. No one needs to ask questions—everything needed is right there.

{
  "eventName": "OrderPacked",
  "orderId": 12345,
  "items": [
    { "sku": "ABC123", "qty": 2 }
  ],
  "warehouseId": 19,
  "totalWeightGrams": 1850
}

Why it’s powerful
- Zero need for back-calls → full decoupling
- Highly resilient—consumers can process events even if upstream is down
- Faster pipelines, fewer moving parts
Trade-offs
- Larger event payloads → more bandwidth/storage
- More careful schema management
- Potential latency/throughput impact in high-volume streams

You’re essentially pre-plating the data, which costs more effort upfront, but saves everyone time downstream.

Pattern 3: Event Sourcing

Instead of storing only the current state, Event Sourcing stores every change as an immutable event.

State is rebuilt by replaying the events.

Just like a kitchen’s order history tells the complete story of what happened throughout service, event sourcing gives you a full timeline of every change.

Example (C# Aggregate Rehydration)

var events = eventStore.LoadStream("Order-12345");
var order = OrderAggregate.Rehydrate(events);

Why it’s valuable
- Perfect audit trail
- Time-travel debugging
- Ability to replay events for recovery or analytics
Trade-offs
- Higher cognitive load for newcomers
- Requires rigorous versioning
- Requires maintaining projections/read models

CQRS Note

Event Sourcing often pairs with CQRS—splitting commands (writes) from queries (reads).

It’s like chefs cooking in the kitchen while waitstaff maintain menus, tables, and customer-facing views.

Each side does what it’s optimized for.

Pattern 4: Choreography (Decentralised Workflow)

With choreography, services react to each other’s events without a central coordinator.

It’s like a well-trained kitchen crew: when the grill station finishes cooking a steak, the garnish station knows it’s their turn—without anyone shouting instructions.

Benefits
- Fully decoupled
- Naturally scalable
- Easy for new services to join by subscribing
Drawbacks
- Harder to visualize the full workflow
- Risk of event spaghetti
- Difficult to enforce global ordering or handle cross-service failures

Great for simple flows where each station knows what to do next.

Pattern 5: Orchestration (Service Composer / Workflow Engine)

Orchestration introduces a conductor—a central service that coordinates each step of the workflow.

Think of it like a head chef calling out the steps during a complex dish:

“Start the sauce.”

“Grill the chicken.”

“Plate it.”

The orchestration engine takes responsibility for the ordering and coordination.

public class DispatchOrchestrator 
{
    public async Task Handle(OrderPaid evt)
        => await Send(new ReserveStock(evt.OrderId));

    public async Task Handle(StockReserved evt)
        => await Send(new BookShipment(evt.OrderId));

    public async Task Handle(ShipmentBooked evt)
        => await Send(new MarkOrderReady(evt.OrderId));
}

When orchestration is ideal
- Multi-step workflows
- Processes requiring retries and compensation
- Compliance requirements → clear traceability

Choreography scales. Orchestration brings order to complexity. Many systems end up using both.

Best Practices for Event-Driven Systems

Idempotency Everywhere

Events may be delivered more than once.

Consumers must behave safely even if they “see the same order twice.”

Just like the kitchen must avoid making the same dish twice if the order docket is accidentally duplicated.

if (db.HasProcessed(evt.Id)) return;
Process(evt);
db.MarkProcessed(evt.Id);

In high-throughput, distributed systems, rely on unique constraints on the event ID (or a combination key) in the MarkProcessed step. This guarantees atomicity and prevents race conditions if two consumers attempt to process the event simultaneously.

Durable, Replayable Streams

Use platforms that retain events reliably:

Kafka
AWS EventBridge + SQS
Pulsar
EventStoreDB

Replay is the equivalent of reviewing the order history after service to understand what happened.

Explicit Event Versioning

Events evolve as the business evolves.

Always version your events.

{
  "eventName": "OrderCreated",
  "version": 3,
  "orderId": 12345
}

This is like updating the recipe book—you need to know which version was used.

Event Contract Management (Schema Evolution)

Managing the schema itself is a real operational challenge.

Common solutions

Schema Registry (Confluent, AWS Glue)
Avro / Protobuf with compatibility modes
Automated consumer-driven contract tests

Just as a restaurant must keep recipes and menus consistent across teams, event schemas must stay compatible across services.

Domain-Driven Event Naming

Good events describe meaningful business events—not technical state changes.

✔ OrderPaid

✔ ShipmentDispatched

✔ StockShortageDetected

These read like “kitchen tickets”—instantly meaningful across teams.

Correlation IDs

Attach a correlation ID that follows the event across the system.

It’s your equivalent of an order number in a busy restaurant—the thing that ties together all actions associated with a single request.

x-correlation-id: d387f799e001-4a12-a3f1

Why Correlation IDs are Essential

In a decoupled EDA, the logical flow of a single business request is spread across multiple services, message queues, and logs. Without a correlation ID, this flow is almost impossible to trace.

Distributed Debugging: If a customer reports a failure on order 12345, you can search your centralised logging system (like Splunk or ELK) using the correlation ID and instantly retrieve every log line, from every service, that contributed to that order's fulfillment.
Request Tracing: They are the backbone of Application Performance Monitoring (APM) tools, which visualize the end-to-end path, latency, and dependencies of a request across your entire system.
Cross-System Auditing: They provide the non-repudiable link between an incoming API call and the final persistent action (e.g., database write or shipment creation), fulfilling compliance needs.

A system without correlation IDs is a black box. They are the single most important tool for turning a distributed system into something observable and debuggable.

Conclusion

Event-driven architecture unlocks scalability, resilience, and autonomy across teams. By understanding patterns like event notification, ECST, event sourcing, choreography, and orchestration, you can match your workflow’s needs to the right design.

The light kitchen analogy highlights what makes EDA so powerful: each station works independently, yet the whole system flows smoothly.

Combined with strong practices—idempotency, schema governance, replay, and correlation—these patterns help systems evolve with confidence even under rapid growth.

November 10, 2025

Michael Sidharta

Moving from Django DRF to Ninja API / Pydantic

November 10, 2025

Michael Sidharta

As our project grows, we're always looking for ways to streamline development, improve performance, and enhance the developer experience. Recently, we've been exploring a shift from our traditional Django REST Framework (DRF) API patterns to a combination of Django Ninja API and Pydantic. This blog post will delve into our motivations for this change, the benefits we've observed, and some considerations for others contemplating a similar transition.

Why Consider a Change from Django DRF?

Django REST Framework has been a robust and widely adopted solution for building APIs with Django. It provides a comprehensive set of tools, including serializers, viewsets, and excellent browser-based API interfaces. However, as our needs evolved, we identified areas where a different approach could offer advantages:

Boilerplate Code: While DRF offers powerful abstractions, creating serializers, views, and viewsets can sometimes lead to a significant amount of boilerplate code, especially for simpler APIs.
Performance: For certain use cases, the overhead of DRF's serializer validation and rendering can impact performance, particularly in high-throughput scenarios.
Modern Python Features: We were keen to leverage modern Python features like type hints and data validation more extensively, which are core to Pydantic.
Developer Experience: A more concise and explicit way to define API endpoints and data structures could improve developer productivity and reduce potential errors.

Introducing Django Ninja API and Pydantic

Django Ninja API

Django Ninja is a web framework for building APIs with Django and Python 3.6+ type hints. It's heavily inspired by FastAPI and offers a number of compelling features:

Type Hinting for API Endpoints: You define your request and response models using Pydantic, and Ninja automatically validates and serializes the data based on these type hints.
Automatic OpenAPI (Swagger) Documentation: Just like FastAPI, Ninja generates interactive API documentation out of the box, making it easy to explore and test your API.
Fast Performance: Ninja is designed for speed, with minimal overhead and efficient request/response handling.
Simplified View Definitions: API endpoints are defined as simple Python functions, reducing the complexity often associated with DRF viewsets.

Pydantic

Pydantic is a data validation and settings management library using Python type hints. It's incredibly powerful for:

Data Validation: Automatically validates data against defined schemas, ensuring data integrity and catching errors early.
Serialization/Deserialization: Easily converts Python objects to and from JSON (or other formats) based on the defined models.
Runtime Type Checking: While Python's type hints are typically for static analysis, Pydantic brings runtime type checking to your data.

The Combo: Ninja API and Pydantic in Action

The synergy between Django Ninja API and Pydantic is where the real magic happens. Pydantic models define the structure and validation rules for both incoming request data and outgoing response data. Django Ninja then uses these Pydantic models to automatically:

Validate Incoming Request Body: Any data sent to your API endpoint is automatically validated against the specified Pydantic model. If the data doesn't conform, a clear validation error is returned.
Serialize Outgoing Response Data: When you return data from your API endpoint, Ninja uses the response Pydantic model to serialize it into the appropriate format (e.g., JSON).
Generate OpenAPI Documentation: The Pydantic models directly contribute to the rich and accurate OpenAPI documentation, describing the expected request body and the structure of the responses.

The Core Shift: From Serializers to Schemas

This is the most significant change when moving from DRF to Ninja.

In DRF, a Serializer handles both input validation and output serialization.

  
    # DRF Serializer
from rest_framework import serializers
from .models import Product

class ProductSerializer(serializers.ModelSerializer):
    class Meta:
        model = Product
        fields = ['id', 'name', 'price']

# DRF ViewSet
from rest_framework import viewsets

class ProductViewSet(viewsets.ModelViewSet):
    queryset = Product.objects.all()
    serializer_class = ProductSerializer

  

In Django Ninja, you use Pydantic Schemas for validation and a separate ModelSchema for serializing Django models.

Migrating from Django DRF to Django Ninja: A Developer's Guide

For years, Django REST Framework (DRF) has been the go-to for building robust APIs in Django. It's a powerful, feature-rich library with a massive community and a well-established pattern of Serializers, ViewSets, and Routers. But a new contender has emerged, inspired by the speed and simplicity of FastAPI: Django Ninja. If you're considering a switch, you're not alone. This guide will walk you through the key differences and how to make the move, leveraging the power of Pydantic.

Why Make the Switch?

DRF is a fantastic tool, but it can be verbose. The typical workflow often involves creating a Serializer class for data validation and serialization, a ViewSet for handling CRUD logic, and then a Router to generate the URLs. This can lead to a lot of boilerplate code, even for simple endpoints.

Django Ninja, on the other hand, is built on Pydantic and Python type hints. This modern approach offers several compelling benefits:

Less Boilerplate: You define your API endpoints as simple functions, using type hints for request and response data. Pydantic handles the heavy lifting of validation and serialization, drastically reducing the amount of code you need to write.
Automatic Documentation: Just like FastAPI, Django Ninja automatically generates interactive OpenAPI documentation (Swagger UI and ReDoc) from your type-hinted code. This means no more manual documentation or separate packages.
Intuitive & Explicit: The code is highly readable and explicit. Instead of relying on ViewSet magic, you define each endpoint with a simple decorator (@api.get, @api.post, etc.).

The Core Shift: From Serializers to Schemas

This is the most significant change when moving from DRF to Ninja.

In DRF, a Serializer handles both input validation and output serialization.

  
    from rest_framework import serializers
from .models import Product

# DRF Serializer
class ProductSerializer(serializers.ModelSerializer):
    class Meta:
        model = Product
        fields = ['id', 'name', 'price']


# DRF ViewSet
from rest_framework import viewsets

class ProductViewSet(viewsets.ModelViewSet):
    queryset = Product.objects.all()
    serializer_class = ProductSerializer

  

In Django Ninja, you use Pydantic Schemas for validation and a separate ModelSchema for serializing Django models.

  
    # Django Ninja Schemas
from ninja import Schema, ModelSchema
from .models import Product

# For request payload validation (input)
class ProductIn(Schema):
    name: str
    price: float

# For response data (output)
class ProductOut(ModelSchema):
    class Config:
        model = Product
        model_fields = ['id', 'name', 'price']

  

The ModelSchema is particularly powerful as it automatically creates a Pydantic schema based on your Django model, handling the conversion for you. This means you can often return a Django QuerySet or model instance directly, and Ninja will use the response schema you defined to handle the serialization.

A Practical Example: Migrating a CRUD Endpoint

Let's imagine a simple API for a Product model.

DRF Pattern (the old way):

Model (models.py):

  
    from django.db import models

class Product(models.Model):
    name = models.CharField(max_length=255)
    price = models.DecimalField(max_digits=10, decimal_places=2)

Serializer (serializers.py):

  
    from rest_framework import serializers
from .models import Product

class ProductSerializer(serializers.ModelSerializer):
    class Meta:
        model = Product
        fields = '__all__'
  

ViewSet (views.py):

  
    from rest_framework import viewsets
from .models import Product
from .serializers import ProductSerializer

class ProductViewSet(viewsets.ModelViewSet):
    queryset = Product.objects.all()
    serializer_class = ProductSerializer

  

URLs (urls.py):

  
    from django.urls import path, include
from rest_framework.routers import DefaultRouter
from .views import ProductViewSet

router = DefaultRouter()
router.register('products', ProductViewSet)

urlpatterns = [
    path('', include(router.urls)),
]

  

This setup automatically generates all your CRUD endpoints, which is a key advantage of DRF, but also abstracts away a lot of the implementation.

Django Ninja Pattern (the new way):

Model (models.py): Remains the same.

API Logic (api.py): This is where everything happens.

  
    from ninja import NinjaAPI, ModelSchema, Schema
from typing import List
from .models import Product

api = NinjaAPI()

class ProductIn(Schema):
    name: str
    price: float

class ProductOut(ModelSchema):
    class Config:
        model = Product
        model_fields = ['id', 'name', 'price']

@api.post("/products", response=ProductOut)
def create_product(request, payload: ProductIn):
    product = Product.objects.create(**payload.dict())
    return product

@api.get("/products", response=List[ProductOut])
def list_products(request):
    return Product.objects.all()

@api.get("/products/", response=ProductOut)
def get_product(request, product_id: int):
    return Product.objects.get(id=product_id)

@api.put("/products/", response=ProductOut)
def update_product(request, product_id: int, payload: ProductIn):
    product = Product.objects.get(id=product_id)
    for attr, value in payload.dict(exclude_unset=True).items():
        setattr(product, attr, value)
    product.save()
    return product

@api.delete("/products/")
def delete_product(request, product_id: int):
    product = Product.objects.get(id=product_id)
    product.delete()
    return {"success": True}

  

URLs (urls.py):

  
    from django.urls import path
from .api import api

urlpatterns = [
    path("api/", api.urls),
]

This setup is more manual, but the logic is right there in the function. You can clearly see the input (payload: ProductIn) and the expected output (response=ProductOut), making the code self-documenting.

Benefits We've Experienced

Since adopting this combo, we've observed several significant improvements:

Reduced Boilerplate: We can define a complete API endpoint with input validation, output serialization, and automatic documentation in a much more concise way.
Simple API implementation: When you look at it at a glance, it almost looks like something that is more of a FastAPI implementation. We want to keep things simple, define your endpoints, your expected query parameter and data, and report back the response. At a glance for one endpoint, it looks very clear and nothing is hiding you from correctly guessing what this endpoint is looking for and what it is responding
Less magic / hidden operations: Django DRF is very powerful - however as for every framework, it hides certain things that some developers might not be aware of, often catching them off guard when trying to trace things. Pydantic makes it clearer
Enhanced Type Safety: Leveraging type hints with Pydantic has made our API code more robust and less prone to common data-related errors. It's also made our codebase easier to understand and maintain. It also identifies any type errors on building time, reducing any chances of runtime errors
Better Developer Experience: The automatic OpenAPI documentation and the explicit nature of Pydantic models have made it easier for developers to build and consume our APIs.
Better code encapsulation: As your code gets bigger, often you will find integrating your model everywhere is not going to be maintainable in the future (Think about modifying your model or putting a cache layer above it since it is a big model). With Ninja / Pydantic implementation, your model does not have to integrate with the API, furthermore you can try to implement service based pattern to separate the concern between API implementation and the actual business logic, which again makes it way cleaner

Considerations for Transition

While the benefits are clear for us, a transition isn't without its considerations:

Learning Curve: Developers familiar with DRF's serializer-heavy approach will need to adapt to Pydantic's data modeling.
Existing Codebase: Migrating an existing, large DRF codebase to Ninja/Pydantic requires a thoughtful strategy, potentially taking an iterative approach.
Ecosystem Maturity: While both Django Ninja and Pydantic are mature and widely used, DRF has a larger and more established ecosystem of plugins and community support.

Conclusion

The move from Django DRF API patterns to the Django Ninja API and Pydantic combo has been a positive step for our team. It has allowed us to build more performant, maintainable, and type-safe APIs with a better developer experience. While DRF remains a powerful tool, for our current and future needs, the elegance and efficiency of Ninja and Pydantic are proving to be a winning combination.

We encourage other Django developers to explore this powerful duo, especially if you're looking to modernize your API development workflow and leverage the full potential of Python type hints.

October 6, 2025

Kogan Dev Blog

Culture, Learning & Growth

Four Years Strong: Celebrating Our Koganniversaries

October 6, 2025

Kogan Dev Blog

Culture, Learning & Growth

In a talent landscape full of competitive opportunities, where change and turnover are part of the norm, Kogan.com stands out as a place where people choose to stay, grow, and advance their careers. Many of our team members have long tenures, with some contributing as long as 12 or 15 years, reflecting the strong culture and opportunities here.

This month, we celebrated three team members reaching their four-year milestone. To mark the occasion, we spoke with them about what has made their journey so rewarding, what has kept them at Kogan.com, and what continues to inspire and excite them as part of our team.

Adam Slomoi

Adam is currently the Tech Lead of a squad. He joined Kogan as a Software Engineer and quickly progressed to Senior Software Engineer, and most recently to Tech Lead. In this role, he continues to sharpen his technical skills while growing his passion for people leadership.

How has your role or perspective on engineering evolved over the last four years?I’ve been fortunate to work across different areas of the business and on various components of our system during my time at Kogan. One theme that has remained consistent throughout, and that I’ve gained a greater appreciation for, is the focus on business outcomes.

How did you feel stepping into your first leadership role, and what did you learn from the experience?
I felt well prepared before officially taking on a leadership role. We have a very collaborative team, and there have been many opportunities along the way to have a say and help set the direction for the team.

Sam O’Halloran

Sam started at Kogan in his first formal software engineering role straight out of university. He has honed his technical skills, contributed significantly to his team, and grown in both technical depth and breadth, applying best practices efficiently.

Which project are you most proud of, and what made it exciting or unique?
I’m most proud of the work on product variants. It wasn’t one big launch, just lots of smaller fixes that added up. I tightened grouping logic in pipelines and batch jobs, cleaned up SPS edge cases, and ensured changes flowed through to OpenSearch. On the UI side, I tweaked filters, added a simple horizontal selector for single-variant products, enabled a promo palette, fixed dropdown ordering, and added a VariantGroup sitemap for better SEO. It was satisfying because variants are messy in the real world, and these changes made choosing the right option faster and clearer for customers, and saner for our internal teams.

Can you share a moment where your work made a noticeable impact for the team or the product?
I led a performance pass on our Product Listing Page. I cut work above the fold, switched tiles to responsive images, added light skeletons, virtualised the brand filter with React Virtuoso, lazy-rendered heavier lists, and fixed scroll and CLS issues. The page now loads faster and filtering feels smoother, which reduced the performance issues we were tracking on that screen.

I also flagged follow-ups for our Product List endpoint to make it simpler, faster, and more reliable.

Yanxu Zheng

Yanxu is a highly skilled Senior Software Engineer who has developed deep technical expertise. He was recently promoted to a tech lead role and is now responsible for leading a large squad.

What’s one of the most interesting technical challenges you’ve solved during your time here, and how did you approach it?
Our keyword-based search sometimes failed to grasp user intent, leading to irrelevant results. I tackled this by leading an experiment to test whether a hybrid search model combining keyword and semantic matching could perform better. To avoid any risk to live services, I ran the experiment on a parallel cluster that mirrored production data. We then conducted an A/B test comparing the new hybrid search against the existing keyword search.

The results were clear: while the new model showed a modest improvement in relevance for some queries, it produced no statistically significant lift in user conversion. Based on the data, we decided not to roll out the feature, as the additional infrastructure cost wasn’t justified by the lack of business impact. My key takeaway was that a technical improvement is only valuable if it moves a core business metric. This experience reinforced my approach of always tying engineering efforts to measurable business outcomes.

Is there a problem you tackled that taught you a new skill or changed how you think about engineering?
I once fixed major performance issues in our OpenSearch cluster by challenging the official best practices. The standard advice was to use 10–50GB shards, but our search latency was poor. I ran experiments and proved that for our specific workload, smaller 5–10GB shards were far more efficient. Switching to smaller shards cut our query latency by over 20%. The experience taught me to always validate standard guidelines with real-world data, as optimal solutions are context-dependent.

September 23, 2025

Mark Elsden

Building Your Own AI Agent

September 23, 2025

Mark Elsden

The DEBI (Data Engineering and Business Intelligence) team recently attended the DataEngBytes 2025 conference, where the hot topic for the year was, unsurprisingly, AI agents. My favorite talk, by Geoffrey Huntley, presented a powerful and surprisingly simple idea: It’s not that hard to build an agent; it’s a few hundred lines of mainly boilerplate code running in a loop with LLM tokens. That’s all it is!

Kogan DEBI team at the DataEngBytes conference 2025

The speaker’s main point was that things were developing extremely rapidly in the AI space, but rather than worrying about how AI might take engineering jobs in the near future, we should become AI producers, leveraging agentic AI to automate things, from data pipelines to our own job functions. Understanding this is, in his words, "perhaps some of the best personal development you can do this year."

This idea is both liberating and empowering. It transforms the conversation from one of anxiety about job security to one of excitement about a new, fundamental skill. Let's pull back the hood on how these agents work and understand the simple primitives that allow us to become producers of automation, not just consumers.

The Fundamentals: The Shift from a Tool to a System

Before you write any code, you need to understand the new paradigm. We’re moving beyond just using Generative AI (Gen AI) as a tool and are now using it to build a complete system: an AI Agent.

Generative AI (Gen AI): The Creator: This is the broad category of AI models that are designed to create new content. LLMs are the most common form of this. They are reactive; you give them a prompt, and they generate a response—be it text, code, or an image. Gen AI is the creative engine.

Agentic AI: The Doer: This is a type of AI system that is designed to act with autonomy. You give it a high-level goal, and it uses its "brain" (a Gen AI model) to reason, plan, and execute actions to achieve that goal. This is the proactive part of AI. The speaker referred to these as "digital squirrels" because they are biased toward action and tool-calling, focusing on incrementally achieving a goal rather than spending a long time on a single thought.

The key insight is that the most powerful Agentic AI systems use a highly-agentic Gen AI model (like Claude Sonnet or Kimi K2) as their core decision-maker, and then wire in other, more precise Gen AI models (the "Oracles") as tools for specific tasks, like research or summarization.

The Agent's Heartbeat: The Inferencing Loop

An agent's core function is an elegant, continuous loop. It's the same loop that powers every AI chat application, but with one critical addition: the ability to execute tools.

User Input: The agent takes a prompt from the user.
Inference: It sends the prompt, along with the entire conversation history, to the LLM.
Response Analysis: It receives a response. If the response is a direct answer, it's printed to the user.
Tool Execution: If the response is a "tool use" signal, the agent interrupts the conversation, executes the specified tool (a local function), and then sends the result back to the LLM to continue the conversation.

This simple, self-correcting loop is the engine that drives all agentic behavior.

The Building Blocks: Primitives of a Coding Agent

The power of an agent comes from its tools. A tool is simply a local function with a description, or "billboard," that you register with the LLM. The LLM's training nudges it to call these functions when it believes they are relevant to the user's request. The workshop demonstrates five fundamental tools for building a coding agent.

Read Tool: This tool reads the contents of a file into the context window. It's the first primitive, allowing the agent to analyze existing files.
List Tool: This tool lists files and directories, giving the agent awareness of its environment, much like an engineer running ls to get their bearings.
Bash Tool: This powerful tool allows the agent to execute shell commands, enabling it to run scripts, check processes, or interact with the system. It's the key to making the agent's work actionable.
Edit Tool: This tool allows the agent to modify files by finding and replacing strings or creating new files. When combined with the other tools, it completes the agent's ability to act on the codebase.
Search Tool: This tool uses an underlying command-line tool like ripgrep to search the codebase for specific patterns. It helps the agent quickly find relevant code without having to read every file.

Putting It All Together: The FizzBuzz Example

By combining these primitives, an agent can perform complex, multi-step tasks. In his talk, Geoffrey illustrated this by having an agent solve the classic programming problem of FizzBuzz. This is a classic programming exercise that requires a program to print numbers from 1 to 100, with a few simple exceptions: for multiples of three, print "Fizz" instead of the number; for multiples of five, print "Buzz"; and for numbers that are multiples of both three and five, print "FizzBuzz."

By giving the agent the prompt, "Hey Claude, create fizzbuzz.js that I can run with Nodejs and has fizzbuzz in it and executes it," we are asking it to orchestrate a multi-step process. The agent will use the Edit Tool to create the file and then the Bash Tool to execute the script and verify the output. The speaker then took it a step further, asking the agent to amend the code to only run to a specific number. The agent successfully handled this by using the Read Tool to check the existing code, the Edit Tool to change it, and the Bash Tool again to verify the new output. This ability to continuously loop back on itself to correct and refine its work is the key to a true agentic system.

The Career Implications for Engineers

In the last six months, AI has become "incredibly real," and the ability to build these systems lets us become producers of automation, not just consumers of it. And here's the best part: the skills are completely transferable. The same principles used for a code-editing agent can be applied to automating data pipelines, CI/CD workflows, database management or even parts of our core job functions.

The final message from the talk was super clear: this technology is here, and it’s surprisingly accessible. As engineers, our value in the coming years is going to be defined by our ability to use and produce automation. The most successful Engineers aren't the ones who fear AI, but the ones who embrace it and learn to build these powerful tools. There’s nothing mystical about agents; they're just an elegant loop built on a few core principles. The next step is to start building one yourself.

Further Resources

https://ghuntley.com/agent/ https://ampcode.com/how-to-build-an-agent

July 27, 2025

Guest User

Order Dispatch Systems at Scale

July 27, 2025

Guest User

How to load-balance like a seasoned waiter

Software systems often parallel the real world. Imagine running a busy restaurant, where customers line up to make orders whilst the kitchen prepares the meals. In the software world, your users are the customers, and your backend services are the kitchen. With more people online than ever before, that line might start to grow out the front door. The ability to scale is no longer optional, it is essential.

Know Your Options

Vertical Scaling - Scale Up	Horizontal Scaling - Scale Out
Expanding your restaurant by adding more tables or a larger kitchen. In software terms, this means scaling up your infrastructure. More powerful CPUs, larger memory, increased throughput etc. This is a relative quick fix, but comes with diminishing returns and limits on how big everything can get.	Opening new restaurant locations to serve more customers simultaneously and distribute existing flows. In software terms, this means adding more API servers, more worker nodes or creating many database replicas. This approach is more flexible and scalable than vertical scaling in the long term.

A Steppingstone - Queues

Just like how customers queue for their order, we create a pull-based task queue for our order management system using AWS Simple Queue Service (SQS). Tasks get queued into the SQS, and a consumer service will continuously poll this queue to process the tasks.

This gives a lot of control for the queue consumer to dictate the frequency of polling, which works well in systems that cannot handle high throughput or requires non-concurrency like the SAP ERP (more on that later). SQS also provides built-in dead-letter-queues, retry policies, at least once delivery guarantee and scales automatically.

Vertical scaling involves sizing up the compute power of the consumer (CPU, RAM etc). Horizontal scaling involves spinning up more consumers of the SQS.

However, queues have limitations:

Latency between order arrival and processing.
Inefficient polling process that checks constantly even when the queue is empty.
Limited fan out, message is designed to be consumed by one service making it difficult for different components to react to the same event.
Concurrency issues, need to tune the message visibility timeout to ensure it isn’t picked up by another consumer instance when scaling out.

The Gold Standard - Events

Scalability starts at the architectural level; enter event-driven programming. Instead of queuing up, customers scan a QR code, and their order is sent instantly to the kitchen, the waiter, the pay desk all at once. No delay, no queues.

We recreate this by having an event publisher that sends messages to an event bus. Which notifies the event subscribers: warehouse, emailing service and SAP simultaneously, allowing them to react to the same order independently. Adding the previous vertical and horizontal scaling options mentioned above, creates a powerful system for processing and dispatching orders as the separate components can be scaled independently. Which also lends well to a microservices architecture.

This model mitigates a lot of the previous approach’s deficiencies:

Lower latency, as it is push based not pull based.
Subscribers can be scaled individually.
Event bus is built to handle concurrency.

There are two ways to implement this in AWS: EventBridge and SNS (Simple Notification Service). We choose EventBridge for its ability to handle more complex workflows and native integrations with third party SaaS applications like Zendesk.

Unlike the SNS approach where messages must be published to a specific topic and risks the number of topics growing too large; EventBridge receives from many sources at once. With advanced filtering capabilities, it can inspect the full event payload and route them to the appropriate consumers. Additionally, event archiving and replays are also supported for improved debugging.

Here is the basic implementation:

Publish events to the EventBridge from your application
Define Event Rules that filter events based on its payload
Configure Targets for each rule - they can have multiple targets

Our target will be SQS as it allows our preexisting .NET services to plug into the new event-driven system without major modifications. However, serverless lambda functions are on the table if we can remove the dependency on SAP, more on this later.

While powerful, event-driven architecture is not without its drawbacks:

Debugging and tracing: Events are asynchronous and loosely coupled making it difficult to find cause and effect. Need to set up comprehensive logging and distributed tracing.

Eventual consistency: System components may be temporarily out of sync. Making it harder to understand behaviour. Logic must also be built to handle stale data gracefully.

Event schema evolution: Changing the payload structure can cause breaking impacts for downstream services. Need to document clearly how the payload is consumed and have a versioning strategy.

Despite these challenges, with the right tooling and implementation, event-driven architectures can be made highly observable, testable and resilient.

A Slow Chef in the Kitchen

Sometimes, the bottleneck isn’t your system but it’s external dependencies. In our case, SAP is that slow chef. The Data Interface API is single threaded and does not support batch processing. If you throw too many requests, it will choke no matter how fast other components are. Identification of such bottlenecks are crucial, lest those other scaling efforts are wasted.

Luckily, in our case we can upgrade SAP from the DI API to a modern alternative called the Service Layer. It is designed with scalability in mind:

Uses HTTP and OData protocols
Can parallel process
Automatic load-balancing
Does not require local installation like DI API

These properties make it much easier to develop web and mobile applications which are more accessible than the SAP Windows client. The service layer’s more stateless nature lends nicely to the aforementioned event-driven architecture, bringing SAP in line with the rest of our scalable system.

Conclusion

Just like a restaurant, software systems should be designed with future scalability in mind. Start with simple abstractions like task queues and evolve to fully decoupled event-driven systems. Horizontal scaling is often better and more flexible than vertical scaling. When faced with external bottlenecks, tackle them head-on. Architecture is the business, with intentional design, your restaurant won’t just keep up but thrive.

References

June 1, 2025

Karen Fehmer

Empowering Data Through Self-Service: Behind the Scenes of Our Data Platform

June 1, 2025

Karen Fehmer

At Kogan.com, our data needs have grown alongside the business. As more teams relied on insights to move quickly, it became clear our request-based BI model couldn’t scale. We needed a platform that empowered teams to answer their own questions, trust the numbers, and move independently. That journey led us to build a self-service platform grounded in governance, transparency, and scalability—powered by dbt, Looker, and Acryl (DataHub).

Rethinking Our BI Model

We originally relied on Tableau. It served us well but had limitations: duplicated logic, inconsistent metrics, and limited collaboration with dbt. Tableau workbooks weren’t version-controlled, which made maintaining consistency difficult. To bridge modeling and reporting, we often created extra presentation tables in dbt, adding complexity. We needed a platform that integrated tightly with dbt and supported governed exploration.

A New Architecture: Modular, Transparent, Scalable

We redesigned the platform around a clean, modular flow: Raw Sources → BigQuery → dbt → Looker → Acryl (DataHub) Our data transformations are built in dbt, where we follow a layered modeling structure. While we use stg_ (staging) and int_ (intermediate) models primarily for data cleaning and standardization, the marts_ models are the ones that power our analysis and reporting. These models contain our fact and dimension tables, fully aligned with business logic and ready for consumption in Looker. We’ve integrated CI/CD pipelines using GitHub Actions, and every change is tested before deployment. This includes dbt tests, schema validations, and model documentation to ensure confidence at every layer.

Why Looker Was the Right Fit for Self-Service

Looker offered a structured, governed approach that aligned with our dbt-first architecture. LookML let us centralize business logic, version it with Git, and deploy changes through CI/CD. With support for multiple environments (UAT and Production), we can test safely before releasing to users. The Explore interface gives business users guided access to curated datasets—no SQL required. Users can drill down, apply filters, and explore confidently. This was a big shift from the Tableau model, which often required analyst support. Looker also includes row-level security, role-based access, and an AI assistant that supports natural language queries and chart generation—lowering the barrier for non-technical users. We’ve also developed internal dashboard standards—consistent layouts, filters, and naming conventions—to ensure usability and reduce support needs.

Bridging the Migration

We’re currently in the process of migrating our Tableau reports into Looker. While Looker is more efficient to build with, the migration isn’t just about re-creating dashboards—we’re using it as an opportunity to improve them. For each report we migrate, we review and sometimes refactor the associated dbt models to ensure the logic is clean, reusable, and well-documented. We also take time to redesign visual layouts to be more intuitive and self-service friendly—adding better filters, descriptive labels, and drill-down paths wherever we can. It’s not just a tech migration—it’s a platform and user experience upgrade.

Data Discovery and Observability with Acryl

Alongside dbt and Looker, Acryl (DataHub) has become the foundation of our metadata ecosystem. Acryl helps both technical and non-technical users understand what data exists, where it comes from, and who owns it. It provides searchable documentation, field descriptions, ownership metadata, and lineage tracing across dbt, BigQuery, and Looker. We also rely on Acryl’s observability features for monitoring anomalies and surfacing potential data quality issues. While it's not a test framework like dbt or a freshness tracker, Acryl helps us detect behavioral anomalies, unexpected changes, and broken relationships before they impact end users. Acryl's AI-powered documentation suggestions have also saved us time when onboarding new models or enhancing existing ones, especially for adding descriptions and tags at scale.

Lessons We’ve Learned

If there’s one takeaway, it’s that data tools need infrastructure—both technical and human. You can’t just launch Looker and expect adoption. You need a warehouse that reflects the business, models that users can trust, documentation that’s visible, and governance that feels supportive, not restrictive. We also learned that governance isn’t about locking things down—it’s about making things clear. When users understand what data means, how it’s calculated, and who owns it, they feel empowered, not limited.

Final Thoughts

We didn’t build a self-service platform just to save time—we built it to build trust. By aligning tools like dbt, Looker, and Acryl into a unified ecosystem, we’ve created something bigger than a data stack. We’ve created a culture where teams are empowered to explore, ask better questions, and make faster decisions—without sacrificing governance or quality. This transformation didn’t happen in a vacuum. It was made possible by the incredible efforts of my team—engineering, analytics, and enablement working hand-in-hand. The commitment to transparency, maintainability, and user empowerment is what brought this platform to life. We’re still learning. But we’re proud of how far we’ve come—and even more excited about where we’re going.

May 9, 2025

Andrew Kerton

Beyond the Code: Threat Modeling as Your Security Superpower

May 9, 2025

Andrew Kerton

As developers, we pour our energy into building robust, elegant software. We craft features, optimise performance, and squash bugs. But in today's world, building secure software is just as crucial. Enter Threat Modeling – not as a bureaucratic chore, but as a practical superpower for developers aiming to build resilient applications.

Think of threat modeling as structured foresight: anticipating how things could go wrong from a security perspective before they happen. It’s about stepping into an attacker's mindset to find weaknesses in your own designs. This proactive approach helps weave defenses right into your application's fabric from the start.

Understanding the Battlefield: Core Security Lingo

To talk effectively about security, we need shared terms. Start with your Assets – the valuable parts of your system, like user data or API keys. These face potential Threats, specific actions that could cause harm, often initiated by Threat Agents like hackers or malware.

To guard against threats, we use Controls (or Countermeasures). These are your front-line defenses designed to prevent attacks or detect them early, like authentication checks or input validation.

But what if a threat gets through? That's where Mitigations come in. These are measures aimed at reducing the damage if a control fails and an attack succeeds. For example, while access controls prevent database intrusion (a Control), encrypting the data reduces the impact if someone does get in (a Mitigation).

Finally, be aware of Trust Boundaries – the lines separating parts of your system with different security levels (like frontend vs. backend). Interactions across these boundaries need special attention.

This common language helps us pinpoint and discuss security risks clearly.

Why Add Threat Modeling to Your Toolkit?

"Another process?" you might ask. But integrating threat modeling saves time and headaches later. Catching a security flaw during design is far cheaper and easier than patching a live system under duress. It helps you avoid building inherently vulnerable features or adding complexity that inadvertently creates attack surfaces. Plus, it fosters a shared understanding of architecture and risks across the team. While ideal early on, threat modeling adds value at any stage.

Walking Through the Process: A Practical Framework

A helpful way to approach threat modeling is by systematically answering a few key questions, often broken into phases:

Phase 1: Mapping Your Territory (What are we working on?)

First, define your scope clearly. What system or feature are you analysing now? What's explicitly out of scope? Identify the critical Assets within these boundaries needing protection. Then, break the system down into its core components – external entities, internal processes, data flows, data stores (databases, caches), and map out the Trust Boundaries.

Visualising this is key. Creating a Data Flow Diagram (DFD) using tools like Draw.io or Miro provides an invaluable map. This diagram isn't just for the threat modeling session; it becomes useful system documentation and helps everyone visualise potential weak points.

Phase 2: Anticipating Attacks (What could go wrong?)

With your map, brainstorm potential threats. Put on your "attacker hat." How could someone misuse this system? The STRIDE framework is excellent for this:

Spoofing: Faking identity (user, server).
Tampering: Unauthorised modification of data or code.
Repudiation: Denying an action performed (aided by poor logging).
Information Disclosure: Leaking sensitive data.
Denial of Service (DoS): Making the system unavailable.
Elevation of Privilege: Gaining unauthorised higher-level access.

Examine your DFD, focusing on data crossing trust boundaries, user inputs, and external interactions. Ask, "How could STRIDE apply here?" Go through each category for relevant parts of your system to uncover potential vulnerabilities.

Phase 3: Building Your Defenses (What are we going to do about it?)

Identifying threats is only half the job. Now, decide how to respond. For each significant threat, consider your options. Can you implement Controls (like strong auth, input sanitisation, encryption) to prevent or detect it? Perhaps you can Eliminate the threat by redesigning a workflow or removing a component. Sometimes, you might Transfer the risk (e.g., using a third-party payment processor). For low-probability, low-impact threats, you might consciously Accept the risk, documenting the decision. Prioritise countermeasures for the highest-risk threats first.

Phase 4: Staying Vigilant (Did we do a good enough job?)

Threat modeling isn't a one-off task. Systems evolve, new features appear, and attackers adapt. Therefore, it's crucial to Review and Refine your threat models periodically, especially after significant changes. Keep diagrams and threat lists current. If you handle sensitive or regulated data (GDPR, PCI), ensure your model addresses Compliance needs. The goal is to Integrate threat modeling thinking into your regular development rhythm.

Level Up Your Security Posture

Threat modeling empowers developers to build security into applications proactively. By considering potential attacks and designing defenses early, you create more resilient, trustworthy software, saving significant time and resources compared to reacting to incidents. Start small, pick a critical feature, map it out, think through STRIDE, and make security an integral part of your development craft.

Have fun!

*This article was written with AI assistance :)

Kogan Engineering Blog

Why Chromatic?

Storybook as the Foundation

Automatic API Mocking with OpenAPI

How Chromatic Fits into the Workflow

Lessons Learned So Far

Final Thoughts

Designing Robust, Scalable, Maintainable Event Architectures

Core Patterns in Event-Driven Architecture

Pattern 1: Event Notification

Pattern 2: Event-Carried State Transfer (ECST)

Pattern 3: Event Sourcing

Pattern 4: Choreography (Decentralised Workflow)

Pattern 5: Orchestration (Service Composer / Workflow Engine)

Best Practices for Event-Driven Systems

Idempotency Everywhere

Durable, Replayable Streams

Explicit Event Versioning

Event Contract Management (Schema Evolution)

Domain-Driven Event Naming

Correlation IDs

Conclusion

Why Consider a Change from Django DRF?

Introducing Django Ninja API and Pydantic

Django Ninja API

Pydantic

The Combo: Ninja API and Pydantic in Action

The Core Shift: From Serializers to Schemas

Why Make the Switch?

The Core Shift: From Serializers to Schemas

A Practical Example: Migrating a CRUD Endpoint

Benefits We've Experienced

Considerations for Transition

Conclusion

The Fundamentals: The Shift from a Tool to a System

The Agent's Heartbeat: The Inferencing Loop

The Building Blocks: Primitives of a Coding Agent

Putting It All Together: The FizzBuzz Example

The Career Implications for Engineers

How to load-balance like a seasoned waiter

Know Your Options

A Steppingstone - Queues

The Gold Standard - Events

A Slow Chef in the Kitchen

Conclusion

References

Rethinking Our BI Model

A New Architecture: Modular, Transparent, Scalable

Why Looker Was the Right Fit for Self-Service

Bridging the Migration

Data Discovery and Observability with Acryl

Lessons We’ve Learned

Final Thoughts

Understanding the Battlefield: Core Security Lingo

Why Add Threat Modeling to Your Toolkit?

Walking Through the Process: A Practical Framework

Level Up Your Security Posture