Stories by IcePanel on Medium

Adopting ADRs in Enterprise

IcePanel — Fri, 08 May 2026 15:43:30 GMT

For anyone who has spent the past few years repeatedly giving LLMs the context behind something before it can give you a useful answer, then I’m sure you’ve come to understand just how easy it is for information to become siloed inside even a small organisation and maybe known only by you.

In enterprise organisations, this unintentional information gatekeeping not only means team members have to spend time running around searching for the right person to understand why an architectural decision was made three years ago. A lot of the time, the reasoning behind those decisions has eroded away or been forgotten completely.

This isn’t good, and it’s certainly not useful if you intend to dust off your auth flow and consider something new, but need to check in with those who built the original version. Those previous decisions may be propping up other architectural decisions. It’s like a stack of fragile playing cards, ready to tumble at the slightest gust of wind.

There’s a whole world of ways in which teams may choose to share why important decisions were made. Maybe in long-form text in a word editor, maybe in a markdown file tied to a repo, a Slack thread, an email or perhaps a mixture of all of them (which is unfortunately most common). You get the idea, these decisions and most importantly all of that context is scattered, and often dependent on the right person still being around.

That’s one reason I’m so sold on ADRs (Architectural Decision Records). Designed to serve as lightweight records of the context behind architecture decisions, ADRs are a log of important decisions made by a team, and the context that surrounds them.

What goes inside an ADR?

Michael Nygard, the man who popularised the idea of ADRs, suggests they should consist of five things: a title (a logical name for the decision), a status (was this idea adopted, rejected or superseded), the context, the decision itself, and the consequences.

That’s a great default and many teams also benefit from expanding on it slightly. One of the most valuable additions is an alternatives considered section, a record of the options you evaluated but didn’t choose, and why. This is where the real value of an ADR lives.

Without this, future engineers have no way of knowing whether an alternative was already debated and ruled out, or simply never considered. The difference matters enormously when someone shows up three years later, suggesting you switch from REST to GraphQL and the team has no memory of the two-week debate that already happened… 3 times!

You might also consider including assumptions and constraints that were true at the time of the decision. Architectural decisions are often only correct under a specific set of circumstances. Capturing those circumstances makes it far easier to know when a decision should be revisited.

Example ADR inside IcePanel

Keep them concise

ADRs shouldn’t become short novels. The goal is a document that’s easily digestible, something an engineer or architect can read in five minutes and walk away with a clear understanding of what was decided and why.

It’s far too easy to create documentation that’s “over-engineered” (please pardon the pun).

The reality is that fewer people will read longer documents, and if the important details are buried three pages in, people skim and miss things. The goal should be to keep the main signal as clean as possible. You can always link out to supporting documentation if a topic needs expansion, but resist the urge to make the ADR itself overly explained.

Who writes them, and when

ADRs shouldn’t be the exclusive domain of architects or tech leads. Any engineer driving a significant decision should be authoring one. If people assume it’s someone else’s job, they simply never get written. You may choose to have them only approvable by a tech lead or architect, but they can be approached collaboratively.

The right moment to write an ADR is before or during the decision, not after. After-the-fact ADRs have a tendency to become rationalisations once the outcome is already known. If they are written in the moment, they capture the genuine uncertainty, the tradeoffs actually considered, and the constraints that were real at the time. That’s what makes them useful later.

ADRs have a lifecycle

One of the most important and non-obvious things about ADRs is what happens when a decision gets revisited. You don’t delete the old record. You mark it as superseded e.g. “superseded by ADR-042” and write a new one explaining what changed and why. This creates a useful audit trail you can follow through time. This kind of internal org memory is rare and has outsized value.

How granular should ADRs be?

As a rule of thumb, any decision that required significant debate, or carries a large number of upstream or downstream dependencies, is worth logging. Choosing PostgreSQL over MongoDB for a new core service is an ADR. Deciding on a caching strategy that affects multiple teams is an ADR. Renaming a variable, or choosing a library for a one-off script, is not.

If you find your team debating whether something warrants an ADR, it probably does.

Rolling this out

If you’re interested in adopting ADRs across your organisation, attempting to do this across all teams at once is possible, but not advisable. Start with one team. Ideally, the tech lead is already bought into the idea and understands the value of having important decisions be searchable and clearly logged. Let that team build the habit, refine the template to suit your organisation, and then introduce the concept to other teams, with evidence of how this has been useful for the original team. Start small, experiment, customize, then scale.

Why is now a good time to implement this sort of structure?

LLMs rely on context to give you useful answers and output. The greater the context, the greater the answer. With the IcePanel MCP, your AI tooling can now read directly from IcePanel. Super handy if you’re about to set an agent on a task to rebuild a part of your service and need to surface constraints. It’s becoming clear that teams who invest in internal memory now are going to get significantly more out of AI tooling than those who don’t. ADRs are a small habit with compounding returns, and it couldn’t be a better time to start building that habit!

Get started and write your first ADR ✍️

Impact of AI on Migration Projects

IcePanel — Mon, 27 Apr 2026 16:09:22 GMT

Introduction

Software architects play a key role in maintaining architectures through migration projects. Reasons might be driven by security requirements (e.g., integrating with a new 3rd party system or deprecating legacy software) or complying with new product requirements (e.g., scaling to meet customer demand or supporting new features from developer teams).

AI has a visible impact on software development, from coding to testing to documenting features. Now with the rise of AI agents, there are opportunities to integrate AI workflows for architecture-focused projects.

In this post we’ll talk about the impact of using AI tools like Claude Code or Codex for architecture projects. We’ll cover what a migration project looks like, the pros and cons of using AI, and tips for managing such projects.

Case study: Database migration project

I once had a 6-month project to migrate our primary database (Aurora PostgreSQL) with multiple microservices connected directly. There was a product discussion to scale our system, and as a result of horizontally scaling the microservices, we started hitting connection limits to Aurora. The services were exhausting the database’s max connections during peak traffic and causing intermittent failures and slow queries.

Wearing the architect’s hat, we started to scope the problem. The project seemed straightforward on paper, but there were several dependencies that required a careful rollout strategy. For example:

How many services were connected to our primary database?
What were the read/write patterns? Some services were read-heavy and could be routed to read replicas. Others needed write access to the primary.
What other migrations or projects are in flight? We had to coordinate timelines with two other teams shipping features that depended on the same database.
What would the post-migration deployment process look like? Every team deploying a new service needs to know the new connection path.

Our solution was to introduce RDS Proxy as an intermediary layer between the database and services. This allows services to point to the proxy endpoint instead of the database directly. The proxy handles connection pooling, failover, and routing to read replicas.

To manage a safe migration, we had to roll this out into multiple stages:

Stage A: Deploy RDS proxy
Stage B: Migrate services one by one
Stage C: Scale out the architecture by adding more read replicas

The proxy deployment itself was straightforward. What took time was the coordination: testing each service’s behaviour through the proxy under load, validating failover scenarios, getting sign-off through the organisation’s process, and sequencing the migration across teams without disrupting production traffic.

AI impact

What worked well was collecting data points with agentic tools like Claude Code or Cascade and building a prototype to test the RDS proxy connection. For example, we used AI to audit the 12 services connected to the primary database and map their query patterns. We used AI to prototype most of the Terraform configuration for the proxy. It also wrote the migration playbook for other teams. Each service migration required updating infrastructure environment variables, running regression tests, and redeploying.

How can AI help

There are three main things AI can massively help:

Automating repeatable tasks
Prototyping and researching solutions
Challenging your proposals

1. Automate repeatable tasks

Architects or engineers will be asked to follow some process when driving this migration project. Depending on your organisation, you might need to write story-pointed tasks, share a technical spec for review, or give regular updates. Some of these tactics are repeatable steps that architects have to allocate time for to deliver a project successfully. However, they can be time expensive for an engineering resource, especially at scale. That’s where AI automation can help.

AI excels at these repeatable artifacts. You can prompt it to generate Jira tickets from a migration checklist, draft status updates from commit logs, or convert meeting notes into action items. The key is investing in building templates that align with your organisation’s format so it can produce consistent outputs with minimal editing.

2. Prototyping and researching ideas

Architects need to prototype solutions and evaluate alternatives before committing. AI can help accelerate prototype development and generate comparisons with the main architect as the decision maker.

For example, in our RDS proxy migration, we used AI to compare connection pooling alternatives (RDS Proxy vs application-level pooling) and draft Terraform modules for services. This let us evaluate trade-offs in order of hours without investing too much development time.

3. Challenging your proposals

Before I publish a migration plan or share a technical spec with stakeholders, I like to prompt AI to critique my ideas. I’ve found AI surprisingly good at surfacing edge cases and suggesting changes. If I’m brainstorming a new solution, I often try to challenge my thinking and see if there’s a better way to solve a problem. With AI, I’m able to broaden the solution space and identify the optimal solution to adopt.

Where AI falls short

1. Verbosity (AI-slop)

AI is good at generating content, not context. Whenever you’re prompting an LLM to explain or document, it will often lean towards generating more characters to make the overall output “look” good, but with marginal value.

2. Over-simplification

AI is good at solving problems based on the parameters it is aware of. As you work on a migration project, it is a good practice not to accept AI-generated solutions at face value, because there is a good chance it is missing context and oversimplifying the problem you’re trying to solve. For example, if you have two repositories: (1) Application repo and (2) Infrastructure repo, and you gave access to (2) to help come up with a migration plan. It is almost certain that without scanning other sources like (1), it will misjudge the complexity of the problem and give answers without seeing the full picture.

My advice is to try to give access to multiple tools/MCPs, but be critical of the output and identify missing context.

3. Delegating critical decisions to AI

Jeff Bezos once introduced the idea of “Decision Types” that later became a core part of Amazon’s culture: Type 1 and Type 2 decisions. He wrote the following in a shareholder letter:

“Some decisions are consequential and irreversible or nearly irreversible — one-way doors — and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation. If you walk through and don’t like what you see on the other side, you can’t get back to where you were before. We can call these Type 1 decisions. But most decisions aren’t like that — they are changeable, reversible — they’re two-way doors. If you’ve made a suboptimal Type 2 decision, you don’t have to live with the consequences for that long. You can reopen the door and go back through. Type 2 decisions can and should be made quickly by high judgment individuals or small groups.”

Using AI for Type 2 decisions is a great use case. It frees up decision-making space from engineers and allows them to focus on the right problems. However, using AI for Type 1 decisions can be very expensive and risky. Primarily, because you are accountable for the end product of your decisions, and delegating it to AI can damage your reputation and affect the overall project.

Good advice is to explore AI tools and automate most (if not all) Type 2 decisions, and allocate deep thinking for Type 1 decisions.

Conclusion

Architectural thinking has and continues to be a core skill required by businesses from their architects. AI cannot replace that skill, however it can add great leverage for software architects like automating repeatable tasks, prototyping, and research. On the flip side, AI falls short on things like context-heavy decisions and accountability.

If you’re working on a migration project, try to use AI to accelerate the mechanics, but own the strategy. Software architects and engineers are needed for human judgment on many activities like project scoping, coordination, stakeholder alignment, and planning architectural changes.

📚 Resources

Design a metrics & alerting system using IcePanel

IcePanel — Tue, 03 Mar 2026 22:19:43 GMT

📝 Introduction

In this post, we’ll design a Metrics Monitoring and Alerting System that engineers use for their infrastructure observability and maintenance health. The industry standard for this system is a combination of two technologies: Prometheus (metrics collection, storage, and alerting) and Grafana (visualisation). We’ll break down the problem, establish scope, and design the system as software architects. We’ll also create four hierarchical diagrams: Context, Container, Component, and Code (not familiar with these? Read this).

Each diagram will be annotated to explain the key building blocks and responsibilities, and we’ll highlight how users interact with the system by visualising data flow using IcePanel Flows.

We’ll start by defining the scope of the system through its functional requirements (what the system does) and non-functional requirements (the qualities it should have). From there, we’ll progressively go through each layer of the C4 model to build the overall architecture.

You can view the final architecture at this link: https://s.icepanel.io/DWnaysJ3cbCQqg/IlDZ

🔎 Scope

A well-designed monitoring and alerting system plays a key role in providing visibility into the health of a system’s infrastructure. This is crucial for engineers who want to ensure high availability and reliability of their systems. This will be factored into our non-functional requirements.

First, we define what we mean by metrics in this post:

Infrastructure metrics: Operational system metrics that contain low-level usage data such as CPU load, memory usage, and disk space.
Software metrics: Service-level metrics containing higher-level data like requests per second, request latency, or number of running servers in a cluster.
Business metrics: This is outside the scope of this post. We’ll focus on infrastructure and software metrics.

The functional requirements for this system are:

Engineers should be able to query data sources (Prometheus) and view real-time dashboards and metrics (Grafana).
The system should evaluate alert rules (Prometheus) against metrics and page on-call engineers when rules have been breached.

The non-functional requirements for this system are:

Highly scalable. Our metrics collection pipeline should accommodate growing metrics and alert volume. Let’s quantify it as 100 server pools and 100 machines per pool, scraping ~100 metrics per machine every 10 seconds. Roughly 100,000 data points / second or ~8.6 billion data points per day! This estimate describes a write-heavy system.
Low latency. Viewing dashboards should reflect near real-time metrics. Engineers should not miss any fired alerts.
Availability >> consistency. Losing a few data points is tolerable. However, the system must be available 24/7, especially for alerts.

We’ll intentionally leave log monitoring (ELK stack) and distributed tracing outside of scope.

Let’s start with the first diagram, Context.

Level 1 — Context

The Context layer defines the actors and external systems we depend on. In this design, we have one actor and three external systems:

Engineer (Actor): Monitors system health by querying metrics and viewing dashboards. This includes software engineers, on-call responders, platform, and SREs. They also get notified by alerts from the system in case a predefined threshold was breached.
AWS CloudWatch (System): An AWS monitoring service that collects metrics from cloud infrastructure and managed services. Our system pulls infrastructure metrics from CloudWatch to surface them alongside application metrics. If you’re interested to learn more, this is implemented by a CloudWatch exporter (read here).
Monitored Services (System): Self-instrumented applications, databases, and message queues that expose a /metrics HTTP endpoint for our system to pull. Our system scrapes them directly using Prometheus’s pull model.
PagerDuty (System): An incident management platform that receives triggered alerts from our system. It routes them to the on-call engineer. Other notification channels (Email, Slack, SMS) are also supported but abstracted away for simplicity.

Prometheus and Grafana are not visible yet in this Context view. They live inside the system boundary, which we’ll cover next in the Container diagram.

Level 2 — Container

The Container layer models independently deployable applications, services, and data stores. We use a microservices architecture to independently scale certain workloads and provide high availability and performance as defined in the non-functional requirements. This is a write-heavy system (100k data points / second) with spiky reads (metrics queries or dashboards refreshing every X seconds). This write-heavy pattern directly shapes the architecture decisions in the Container layer below.

The system is composed of three main parts:

1. Metrics collection:

Metrics Collector Pool (ECS Autoscaling): Pool of metrics collectors that pull /metrics endpoints from monitored services on a configured interval (e.g., 10 seconds). These collectors are distributed across targets using a consistent hash ring to avoid duplicate data collection.
Service Discovery (etcd / Zookeeper): A coordination component that maintains the dynamic registry of scrape targets. It notifies the Metrics Collector Pool when monitored services are added or removed so collectors always know what to scrape.
Message Queue (Kafka): Receives raw metrics from the Collector Pool and buffers them before processing. Decouples collection from storage and prevents data loss when the TSDB is unavailable. Partitioned by metric name and label for parallel consumption.
Stream Processor (Flink): Consumes raw metrics from Kafka, aggregates and processes them, then writes to the TSDB. Handles late-arriving data and reduces write volume through pre-aggregation.
Time-Series Database (InfluxDB): A time-series database (TSDB) that stores all metrics as time-series data with label-based indexing. It provides optimisation features for time-series data like downsampling and data compression.

2. Alerting:

Prometheus Server (ECS): Evaluates alert rules defined in YAML config files against the TSDB on a configured interval. Pushes fired alerts to Alertmanager via HTTP POST.
Alert Manager (ECS): Receives fired alerts from Prometheus Server. Deduplicates, groups, and routes them to Alert Consumers via Kafka.
Alert Store (DynamoDB): Persists alert state (inactive → pending → firing → resolved) to ensure at-least-once delivery across restarts and retries.
Alert Consumer (ECS): Pulls alert events from Kafka and dispatches notifications to PagerDuty, Slack, Email, or HTTP endpoints.

3. Visualisation and querying:

Alerts UI (React): Frontend for engineers to view firing alerts and manage silences via the Alertmanager API.
Metrics Dashboard (React / Grafana Frontend): A UI for engineers to query and visualise time-series metrics via the Query Service.
Query Service (ECS / Grafana Backend): Receives query requests (via PromQL) from the Metrics Dashboard, checks Redis for cached results, and falls back to the TSDB on a cache miss.
Query Cache (Redis): Caches frequently accessed query results to reduce repeated read load on the TSDB.

How the system works:

The system operates across two parallel pipelines (metrics collection pipeline and alerting pipeline), which share the TSDB as the ground-truth datastore.

Metrics pipeline: The Metrics Collector Pool scrapes these /metrics endpoints with Service Discovery (zookeeper/etcd) managing the target list dynamically. Scraped metrics are written to Kafka, consumed by the Stream Processor (Flink), and written to the TSDB. When engineers query dashboards, the Metrics Dashboard sends PromQL requests to the Query Service, which checks the Redis cache first, and falls back to the TSDB on a cache miss.

Alerting pipeline: The Prometheus Server evaluates alert rules from YAML config files against the TSDB every evaluation interval (e.g., 10 seconds). When a threshold is breached, Prometheus Server pushes the alert to Alert Manager via HTTP POST. Alert Manager deduplicates alerts, routes them to the dispatcher, dispatches to consumers (via Kafka), and persists its state to DynamoDB. Alert Consumers pull from Kafka and dispatch notifications to PagerDuty, Email, Slack, or HTTP endpoints. PagerDuty then pages the on-call engineer.

Data flows

We’ve designed two primary data flows using IcePanel. Check them out and see how the system works.

Flow 1: Engineer queries a metrics dashboard

Complete flow: https://s.icepanel.io/DWnaysJ3cbCQqg/Jvuo

Engineer opens the Metrics Dashboard (Grafana UI)
React app sends HTTP GET with a PromQL query to the Query Service
Query Service checks Redis. Cache hit returns immediately; cache miss executes against TSDB
Time-series data is returned and rendered as charts

Flow 2: Alert fires and pages the on-call engineer

Complete flow: https://s.icepanel.io/DWnaysJ3cbCQqg/342p

Prometheus Server evaluates an alert rule against the TSDB
Prometheus Server pushes alert to Alertmanager via HTTP POST
Alert event published to Kafka
Alert Consumer dispatches notification to PagerDuty
PagerDuty pages the on-call engineer

Before going into level 3 (Component), there are a few key architectural decisions worth discussing. We’ll go through the data model, storage, and push vs pull models.

Data Model

Metrics data is recorded as time-series data: a set of values with associated timestamps, uniquely identified by a metric name and optional tags. For example:

An engineer might want to query the system to answer questions like:

What is the CPU load on the production server at 18:00?
What is the average memory load across all Redis caches for the last 6 hours?
What is the total number of requests to our API in the last hour?

Here’s an example Grafana dashboard showing real-time website performance metrics like memory and CPU usage, server requests, login activity, and client-side page load times broken down by percentile (source).

Selecting a database storage

In theory, a general-purpose database could support time-series data, but it would require expert-level tuning to make it work at our scale. For example, a SQL database like PostgreSQL is not optimised for operations you would commonly perform against time-series data. NoSQL is another choice. Cassandra can be used for time-series data. However, this would still require expert NoSQL knowledge to design a scalable schema for storing and querying time-series data.

For this problem, we’ll need a storage system optimized for time-series data. AWS has Timestream as a managed time-series database. According to DB-engines, the two most popular ones are InfluxDB and TimescaleDB, which are designed to store large volumes of time-series data and quickly perform real-time analysis on that data. Both of them primarily rely on an in-memory cache and on-disk storage. They both handle durability and performance quite well. For this example, we’ll pick InfluxDB. There was a previous benchmark that reported InfluxDB with 8 cores and 32GB RAM can handle over 250k writes per second, which fits well within our requirements.

Pull vs Push model

There are two ways metrics data can be collected. In a pull model, dedicated collectors scrape /metrics endpoints from running applications on a configured interval. In a push model, a collection agent is installed on every monitored server as a sidecar container. Examples of push architecture are Amazon CloudWatch and Graphite.

Prometheus uses a pull model. The collector scrapes targets rather than targets pushing to the collector. This is deliberate: it makes the collector the source of truth for what’s being monitored, enables easier debugging (you can curl any /metrics endpoint directly), and allows Service Discovery to manage the target list centrally.

For our design, we use the pull model as the primary path. CloudWatch metrics enter via the CloudWatch Exporter, which translates the CloudWatch API format into the standard /metrics format that Prometheus can scrape.

Level 3 — Component

In the C4 model, a component is a grouping of related functionality encapsulated behind a well-defined interface. We’ll zoom into three components: Prometheus Server, Alert Manager, and Query Service.

1. Prometheus Server

Prometheus Server queries the TSDB directly via PromQL to evaluate alert rules and detect threshold violations. It primarily consists of:

Rule Evaluator: A module that reads alert rules from YAML config files (stored in S3 and downloaded at startup). On every evaluation interval, it executes PromQL queries against the TSDB and compares results against defined thresholds. If a breach persists for the configured duration, it marks the alert as firing and passes it to the Notifier.

Notifier: A module that receives fired alerts from the Rule Evaluator. It pushes fired alerts to the Alert Manager via HTTP POST /api/v1/alerts, where they are deduplicated, dispatched, and persisted.

2. Alert Manager

The Alert Manager receives fired alerts from Prometheus Server and handles the full routing, deduplication, and dispatch lifecycle. It consists of four main modules:

HTTP Handler: Receives the fired alerts via HTTP POST from Prometheus Server and provides GET endpoints to serve the Alerts UI for engineers to view firing alerts and manage silences.

Deduplicator: Suppresses duplicate alerts from the handler. If the same alert fires multiple times within a grouping window, it is collapsed into a single notification to prevent spamming the notification channels.

Alert Dispatcher: Assembles the final alert event and routes it downstream. It publishes the alert event to Kafka for consumption by Alert Consumers.

DynamoDB Client: Persists the alert state transitions (inactive → pending → firing → resolved) to the Alert Store (DynamoDB). It ensures the system can resume correctly across restarts and guarantees at-least-once delivery across retries.

3. Query Service

Query Service is Grafana’s core backend. It receives PromQL query requests from the Metrics Dashboard and executes them against the TSDB. It also caches short-lived query results to Redis to serve repeatable queries.

Query Handler: Receives HTTP GET requests from the Metrics Dashboard containing PromQL expressions. It validates the query, checks Redis for a cached result first, and falls back to executing directly against the TSDB on a cache miss.

The cache is particularly valuable for dashboard panels that auto-refresh every few seconds. Without caching, every render will hit the TSDB directly and create a spiky read load.

Let’s go one level deeper with the source code in the Code layer.

Level 4 — Code

At the code level, we focus on implementation structure rather than deployment. Rather than showing full implementations, we model the public API of each component and the contracts between them. This maps directly to the components described at Level 3. We’ll briefly describe two components: Prometheus Server and Alert Manager.

1. Prometheus Server (Rule Evaluator)

The Rule Evaluator is the core evaluation loop inside Prometheus Server. It loads alert rule groups from YAML config files, each containing one or more rules with a PromQL expression, a threshold, and a duration (how long a breach must persist before the alert fires). On every evaluation tick it queries the TSDB and compares results. If a rule is breached, it hands the alert to the Notifier.

class AlertRule
name: e.g. “HighCPUUsage”
expr: PromQL expression, e.g. “cpu_usage > 0.9”
for_duration: seconds a breach must persist before firing
tags: alert labels in key-value pairs
class RuleEvaluator
def load_rules(config_path): Load and parse alert rule groups from a YAML file
def evaluate(rule: AlertRule): Executes the rule’s PromQL against the TSDB
def run_evaluation_loop(interval_seconds): Evaluates loaded rules on every tick

2. Alert Manager

The Alert Manager receives fired alerts from Prometheus Server and handles the full routing, deduplication, and dispatch lifecycle. It stores the state of the alerts, and handles the dispatch to an alert consumer via a Kafka topic.

class Alert
name: e.g., “HighCPUUsage”
tags: e.g. {“severity”: “critical”, “env”: “prod”}
annotations: e.g. {“summary”: “CPU > 90%”}
state: e.g., inactive, pending, firing, resolved
fired_at:
resolved_at:
class AlertReceiver: HTTP handler for Prometheus Server.
def receive(alerts: list[Alert]): Accepts a batch of alerts from Prometheus Server
class AlertStore: DynamoDB client for persisting alert state.
def save(alert: Alert): Persist an alert and its current state to DynamoDB.
def get_all_firing(): Return all alerts currently in “firing” state.
class AlertDispatcher: Kafka client that publishes alert events to the alerts topic.
def dispatch(alert: Alert): Serialize alert and publish to the Kafka alerts topic.
def get_receiver(alert: Alert): Resolve which receiver this alert should be routed to.

Conclusion

In this post, we designed a metrics monitoring & alerting system using the C4 model on IcePanel. We began with the core requirements and modelled the system top-down, starting with the Context layer, followed by Container, Component, and Code.

We covered several architectural decisions worth taking away from this design.

Pull vs push model: We used Prometheus’s pull model to scrape targets rather than receiving pushed metrics. It keeps the collectors in control of what’s monitored without pushing unwanted data.
Consistent hash ring: We used a coordination component (etcd/zookeeper) to distribute scrape targets across a collector pool without duplication.
Data storage system: We covered the access patterns for a write-heavy system and used a TSDB optimised database like InfluxDB.
Rule evaluation vs routing: We defined the responsibility of evaluating rules against the TSDB and firing alerts (Prometheus Server). We designed the Alert Manager as a separate container that receives fired alerts and handles deduplication, routing, and dispatch to downstream systems.

If you’d like to see more design deep dives, check out:

📚 Resources

Modelling Kubernetes Clusters using C4 Model

IcePanel — Mon, 23 Feb 2026 19:32:06 GMT

Kubernetes (K8s) is the de facto standard technology for orchestrating containerised applications at scale. However, visualising it only as a K8s cluster diagram doesn’t express its architecture in a format accessible to all stakeholders. In this post, we’ll talk about K8s, how to visualise it with the C4 model with an example app (ShopFlow), and some tips to get started.

Introduction

Thinking about architecture with the C4 model works great as a technology-independent abstraction. However, software architectures are often visualised as cloud-specific diagrams like AWS/Azure/GCP diagrams or as Kubernetes cluster views. These diagrams are often static and contain all implementation details at a single view. Details like networking, service meshes, and ingress/egress traffic get in the way when trying to understand the high-level components in the architecture, like core business services or datastores. For organisations running on K8s with little-to-no architectural documentation, there is a modelling problem.

The modelling problem with Kubernetes (K8s)

K8s is the dominant open-source platform for orchestrating containerised applications, maintained by the Cloud Native Computing Foundation (CNCF). It has become the most widely adopted technology for running distributed applications at scale.

K8s comes with its own abstractions and ecosystem, which is great for expressing infrastructure as code (IaC) but not for communicating the architecture as a whole. K8s has declarative resources for networking, scheduling, and service discovery. A Deployment describes how your application runs. A Service gives it a stable network identity. An Ingress handles external routing. A ConfigMap externalises configuration. Other K8s resources can be found in the concepts page.

Inside this ecosystem, deployment and networking details matter to the platform team, but not on an architecture diagram. For example, within an engineering organisation, there are generally three personas that interact with software architecture:

Platform Owner: Engineers working on the infrastructure layer that supports running services and workloads. These are typically platform engineers, site reliability engineers, and software architects.
Service Owner: Engineers working on the product itself, owning parts or all of the deployable source code. These are software engineers, machine learning engineers, or product engineers.
Product Owner: Non-technical stakeholders who own the product roadmap and focus on user-facing interactions and feature delivery. These are product managers or designers.

A K8s cluster diagram works well for Platform Owners because they need to see namespaces, networking, and resource configurations. It’s sometimes useful for Service Owners, who care mainly about their own deployments and how they connect to other services. But it’s rarely useful for Product Owners, who need to understand what the system does without going through infrastructure detail.

The C4 model supports all three personas by offering different levels of detail. Level 1 (Context) gives Product Owners the big picture. Level 2 (Container) gives Platform and Service Owners a clear view of deployable units and their relationships. Level 3 (Component) lets Service Owners zoom into the internals of a specific service. Platform Owners can refer to K8s clusters for more networking details.

In general, K8s describes how architecture is deployed and operated while C4 models the logical architecture. They’re complementary views, not competing ones.

Mapping Kubernetes to the C4 model

So how does translating a Kubernetes cluster into the C4 model work?

Cluster = Context diagram.

For starters, the cluster becomes a single Context layer. In this view, the emphasis is on what the system does and who interacts with it (actors and external systems). At a high level, we draw the third-party APIs the cluster depends on and the users who interact with the system (e.g., admins and customers).

K8s namespace is a dotted boundary, not a C4 diagram.

A K8s namespace is a logical grouping of resources within a cluster. It is a useful mechanism for isolating groups of resources together, but not modelled as a first-class element in the C4 model. For C4, it is a visual dotted-line boundary on the Container diagram. It could also be represented as a Group.

Deployments = Container diagrams.

A K8s deployment defines how an application runs and scales. It specifies the number of instances to run, resource limits, and other metadata. In the C4 model, the Container diagram highlights a set of deployable/runnable units that power the system. Each unit is a K8s deployment. Pods are runtime instances that wrap one or more running containers (not to be confused with Container diagrams!). They do not belong to the C4 model since this is already abstracted by the deployment resource. Code inside the pod belongs to the Component diagram, where you show internal structure like REST handlers, API clients, and domain logic.

Services, Ingresses, and ConfigMaps become annotations or arrows.

A K8s service is a networking mechanism that lets one workload talk to another. In C4, that’s just an arrow between two Containers with a protocol label. Other resources like Ingress rules, ConfigMaps, or Secrets are operational details that don’t need their own boxes on a Container diagram.

ShopFlow: Kubernetes Cluster

To see this mapping in practice, let’s take a look at an e-commerce app called ShopFlow. It is a simplified version of Amazon.com where customers browse products, place orders, and pay online. The system runs on a single K8s cluster.

The cluster is organised into three namespaces:

main — the core business services. This is where the actual e-commerce logic lives: a Web App serving the customer-facing storefront, an API Gateway routing requests to backend services, a Product Catalog Service, an Order Service handling checkout flows, a Payment Service integrating with Stripe, and a Notification Service that sends order confirmations via email and SMS. Data is stored in PostgreSQL, and services communicate asynchronously through RabbitMQ.
platform — the core infrastructure. For example, NGINX for traffic routing and Istio for managing the service mesh. These are workloads owned by the platform team, not the software teams.
monitoring — the observability stack. Prometheus and Grafana for metrics and dashboards, Fluentd shipping logs to Loki.

Given this K8s cluster, how should this be modelled using C4 diagrams? For this scope, we’ll only focus on the main namespace.

ShopFlow: C4 Model

See complete design on IcePanel: https://s.icepanel.io/DWnaysJ3cbCQqg/QXrC

Level 1: Context diagram

The Context (system) diagram is for stakeholders who need to understand what the system does and what it depends on. Think product managers, business stakeholders, and new engineers getting onboarded. In this example, we have two external systems that ShopFlow depends on: a payment provider and a notifications provider. There are two main actors: customers and admins.

Level 2: Container diagram

We zoom inside ShopFlow to show its separately deployable units, Containers.

In this Container view, every service is deployed as a K8s Deployment (except datastores and gateways, which can be deployed as managed cloud services like AWS). We have:

ShopFlow WebApp: for showing product information and order history in the UI.
API Gateway: for routing requests to backend.
Product Catalog: for viewing available products and details.
Order Service: for managing incoming orders.
Payment Service: for managing payment flows.
Notification Service: for sending events to customers.
Message Queue: for inter-service communication.
PostgreSQL: for persisting product data and order history.

These deployable units are abstracted within the Container diagram without additional details specific to K8s. Networking details are kept minimal here to reduce cognitive load on viewing these diagrams. In general, every K8s deployment should be represented as a Component inside the Container view.

Level 3: Component diagram

Below is the Component diagram of the “Order Service”. It contains the following modules:

The Order Controller handles incoming HTTP requests. It receives requests from the API Gateway and delegates to the appropriate internal component.
The Order Core validates orders, applies business rules, and coordinates the checkout flow.
The Order Repository handles data persistence to PostgreSQL. It encapsulates data access patterns so the domain logic doesn’t know or care about SQL.
The Event Publisher publishes domain events (e.g., OrderPlaced, OrderCancelled) to RabbitMQ. The Notification Service consumes these events downstream.
The Payment Client calls the Payment Service to charge the customer.

Notice that at Level 3, we’re no longer thinking about K8s. This diagram describes the internal structure of a single deployable unit. You’d draw the same Component diagram whether the Order Service runs on Kubernetes, AWS ECS, or your laptop. That’s what makes C4’s abstraction consistent.

We won’t cover the fourth diagram (Code) in this example, but you get the idea.

Tips for modelling K8s clusters

To get started, here are three tips for modelling K8s clusters with C4.

1. Focus on the high-level design

Avoid going into the networking details in the K8s world. For example, don’t describe which resources can talk to each other with network policies (similar to AWS’ security groups) or how Istio proxies work. Instead, focus on the Deployments and Services in K8s to map out the user flow.

2. Focus on the resources that enable user flow

If you trace a user flow from start to finish, which workloads does it touch? This is the Container view. Everything else is supporting infrastructure that can be mentioned in annotations or separate documentation, not in the C4 model.

3. Use namespaces as visual groups, not architectural boundaries.

Namespace is a K8s organisational tool. Some teams use namespaces per-team, others per-environment, others per-domain. Use the namespace structure on your C4 diagrams if it helps readers, but don’t let it dictate your architecture.

Here’s also a quick guide for mapping K8s to C4:

Conclusion

Working with the C4 model should not reduce the value of K8s diagrams. One communicates the architecture at different layers, the other describes how the architecture is deployed.

C4 works great for team collaboration as a technology-agnostic framework for visualising systems. K8s is great for showing implementation details, not a high-level overview.

📚 Resources

Design ChatGPT using IcePanel

IcePanel — Wed, 28 Jan 2026 19:54:14 GMT

📝 Introduction

In this post, we’ll design ChatGPT, an AI-powered conversational system that allows users to interact with large language models (LLMs) through natural language. We’ll approach this as software architects and create four hierarchical diagrams: Context, Container, Component, and Code (not familiar with these? Read this).

Each diagram will be annotated to explain the key building blocks and responsibilities, and we’ll highlight how users interact with the system by visualising data flow using IcePanel Flows.

You can view the final architecture at this link: https://s.icepanel.io/DWnaysJ3cbCQqg/sOie

🔎 Scope

ChatGPT’s core functionality consists of two requirements:

Users should be able to send prompts and receive streaming responses from LLMs.
Users should be able to save, load, and search through previous chat histories.

For non-functional requirements, the system should be:

Availability >> consistency, expected ~10 million daily active users (DAU).
Highly scalable. Chat users might peak at 10k-20k/sec calls to the model.
Low latency, users should expect first tokens in ~500 ms and complete responses in ~3 seconds.
Rate limiting. Users are allowed to send at most 60 requests/min.

Let’s start with the first diagram, Context.

Level 1 — Context

The Context layer defines the actors and external systems ChatGPT depends on. We have one actor and one external system:

User: End user who chats to ChatGPT through mobile or web.

Identity Provider (Auth0): Provides federated authentication and single sign-on (SSO) capabilities, allowing users to authenticate using their existing social media or enterprise accounts. Our system receives verified identity tokens and uses them to create or authenticate user sessions without handling passwords directly.
Other external systems can be integrated (e.g., Payment Provider like Stripe, but we’ll leave it outside of scope). The more detailed design comes next, the Container diagram.

Level 2 — Container

The Container layer models independently deployable applications, services, and data stores. In this scope, we’ll design ChatGPT using a microservices architecture to independently scale certain workloads and provide high availability and performance (as defined in non-functional requirements).

Our system is composed of the following Containers:

Completion Service (EC2): The core service responsible for taking a conversation history (a list of messages) as input and generating a context-aware AI response.
Worker Pool (ECS — Autoscaling): Pool of inference workers that execute LLM requests through the model proxy based on model type, and stream generated tokens back to the Completion service in real time.
Model Proxy (EC2): An intelligent router for multi-model API endpoints.
Model API (EC2 GPU-optimised): EC2 machines pre-loaded with LLM weights (e.g., GPT-4, GPT-3.5, or other LLMs) and streams generated tokens back to caller.
Conversation Service (EC2): A microservice for managing conversation state and retrieving and searching in chat histories.

In this architecture, we use the following technologies:

AWS API Gateway: A secure entry point for all API requests coming from the frontend. Responsible for routing incoming requests and managing HTTP streaming (SSE) for real-time response delivery.
Rate Limiter (Redis): A fast in-memory cache for tracking per-user rate limits.
Elasticsearch Cluster: A search engine for full-text search and retrieval of documents (chat history in this example).
DynamoDB: A NoSQL document store for persisting chat histories and metadata.

On a high level, the system works as follows.

The user connects to ChatGPT through the web (chatgpt.com) or mobile. They send a prompt to start a new chat with the backend server. Their requests translate to a POST request (/completions/create) with the user’s prompt, selected model, and other metadata in the payload. The backend server receives this request and routes it to the appropriate model infrastructure (GPU servers), which processes the prompt and streams the generated response back to the user. This response flows through the same path in reverse, from the model servers back through the backend API, which formats and delivers the text incrementally to the UI using Server-Sent Events (SSE), allowing the user to see the LLM’s response appear in real-time as it’s being generated. The user prompt can be modelled in a body request like below:

For more examples, check out OpenAI’s API reference: https://platform.openai.com/docs/api-reference/responses/create

We’ve designed two primary data flows using IcePanel. Check out these flows and play them step by step to see how our system works.

Flow 1: Send a prompt to ChatGPT

Complete flow: https://s.icepanel.io/DWnaysJ3cbCQqg/vPHL

On a high level,

User sends prompt via an HTTP POST
API Gateway validates rate limits
Completion Service acquires a worker and starts inference
Tokens stream back to the user via SSE
Response is persisted to DynamoDB

Flow 2: Find older chat and send a new prompt

Complete flow: https://s.icepanel.io/DWnaysJ3cbCQqg/0Oj7

On a high level,

User sends search keywords to find a previous chat.
Conversation Service queries Elasticsearch for full-text search.
Chat history is fetched and displayed to the user.
User sends a new prompt with full conversation context.
(Flow 1) repeats.

HTTP Streaming and Server-Sent Events (SSE)

When a client sends an HTTP POST request to the Completion API endpoint, the server keeps the response open and streams partial results as they are produced. This is implemented using Server-Sent Events on top of HTTP, which uses a special header (Transfer-Encoding: chunked) that tells the client the data will be streamed into a set of chunks. As tokens are generated by the model, they flow immediately through the system (Model API → Worker → Completion Service → Client) and are flushed to the HTTP response stream. This enables low time-to-first-token (hundreds of milliseconds) and significantly improves latency, as users can start reading the response while the model is still generating it. This is how systems like ChatGPT provide a real-time chatting experience to the user (docs).

Quick note: Some system design readers might suggest using WebSockets (WS) instead of SSE when designing ChatGPT, since WS is a bidirectional protocol commonly used in chat systems like WhatsApp. However, WS introduces significant complexity that requires specialised infrastructure to route and maintain millions of concurrent live user connections. For the scope of this design, that complexity isn’t necessary. One-directional HTTP streaming from the backend is good enough to solve this problem.

Level 3 — Component

In the C4 model, a component is a grouping of related modules encapsulated behind a well-defined interface. Let’s look at the Components in ChatGPT.

1. Completion Service

This service orchestrates the complete LLM inference lifecycle. When a user sends a chat completion request, the Completion API first validates rate limits via the Redis Client. Once approved, the Worker Pool API acquires an available worker from the pool based on the requested model type (GPT-4, GPT-3.5, etc.). These tokens are immediately streamed back through the API Gateway to the user, creating the real-time chat experience.

2. Model Proxy Service

The Model Proxy acts as an intelligent routing layer between workers and Model APIs. The Model Router selects the optimal API endpoint by consulting the Health Checker for endpoint status and the Metrics Collector for latency data. It calculates a routing score combining health, latency, and current load to choose the best endpoint. All requests pass through the Circuit Breaker, which monitors failure rates and prevents cascade failures by temporarily blocking requests to unhealthy endpoints. This architecture ensures requests are always routed to the fastest, most reliable Model API instances across multiple regions.

3. Conversation Service

The Conversation Service manages chat history retrieval and search. When a user requests their conversation history, the Conversation API first queries the Elasticsearch Client for fast full-text search across all messages. For direct conversation retrieval, the History Repository fetches data from DynamoDB, which serves as the source of truth. New messages are persisted to DynamoDB and asynchronously synced to Elasticsearch via DynamoDB Streams for eventual consistency (which is acceptable in this case).

Let’s go one level deeper with the source code in the Code layer.

Level 4 — Code

This is where we can view implementation details at the code level. At this level, we focus on code structure rather than diagrams. We’ll briefly go over two Components: Completion and Conversation.

Code Classes in Completion Service

This component orchestrates LLM inference requests by managing the complete lifecycle from rate limit validation to streaming response delivery. The Completion API provides RESTful endpoints for initiating chat completions. The RedisClient validates user rate limits by checking global user quotas (60 requests per minute) stored in Redis before allowing requests to proceed. The WorkerPoolAPI manages the pool of available workers, handles load balancing, and acquires connections for inference requests.

Completion API

RESTful endpoints for chat completion requests

POST /v1/chat/completions — Initiate a new chat completion
POST /v1/chat/completions/:requestId/cancel — Cancel an ongoing completion request

Redis Client

Manages per-user rate limiting and validation

isAllowed(userId) — Validates if user is within QPS and concurrent generation limits
updateUsage(userId, tokenCount) — Records user request and token usage

Worker Pool API

Manages worker pool and connection lifecycle

acquire(modelType, priority): Worker — Acquires available worker from pool based on model requirements
release(workerId) — Returns worker to available pool after request completion

Code Classes in Conversation Service

This component manages conversation state and retrieves chat history for LLM context. The Conversation API class exposes RESTful endpoints for fetching user conversations and chat history. The API first queries Elasticsearch for fast full-text search across conversation histories before falling back to the History Repository for direct database access. The Elasticsearch client handles all interactions with the Elasticsearch cluster, providing efficient search capabilities. The History Repository serves as the data layer for fetching conversation messages and metadata from the History Database (DynamoDB).

These classes have roughly the following methods:

Conversation API

RESTful endpoints for conversation and chat history management

GET /v1/conversations/:conversationId — Retrieve conversation details and complete message history.
GET /v1/conversations/:conversationId/messages?limit&cursor — Retrieve paginated messages for a conversation.
POST /v1/conversations/search — Search for specific keywords in past conversations

Elasticsearch Client

Manages Elasticsearch interactions for fast conversation search

search(userId, query, filters) — Full-text search across all user conversations
getMessages(conversationId, limit, cursor) — Retrieves messages from Elasticsearch index

History Repository

Data access layer for DynamoDB conversation storage

getConversation(id) — Fetches chat history given conversationId
appendMessage(conversationId, role, content, tokens) — Saves message given conversationId

Conclusion

In this post, we designed an AI chat system like ChatGPT using the C4 model on IcePanel. We began with the core requirements and modeled the system from the top down, starting with the Context layer, followed by the Container, Component, and finally the Code layer.

If you’d like to see more design deep dives, check out:

Let us know which system you’d like to see modeled next on IcePanel!

📚 Resources

State of Software Architecture Report — 2025

IcePanel — Wed, 21 Jan 2026 19:50:55 GMT

State of Software Architecture Report — 2025

🤔 Why did we do this?

Every year, we ask the community to share their thoughts and opinions shaping software architecture. Check out the 2024 results here. Looking back, 2025 was a year of rapid change and continued AI adoption. Interest in AI continued to diffuse, extending into more industries and more complex workflows. How do architects feel about all of this? How has it impacted their day-to-day? We were curious to find out.

We’re excited to share the results from our 2025 survey. Let’s dive in!

🔑 Key highlights

Read this section if you want the TL;DR, or scroll down and look for a detailed breakdown of each section.

75 people responded, with 57% being architects and 29% engineers/developers.
Most respondents were experienced professionals, with 95% being full-time employees. 68% had 6+ years of experience.
Keeping documentation up to date was overwhelmingly the biggest challenge, followed by a lack of standards and finding the right level of detail.
Diagramming tools (87%) and collaborative wikis (79%) were the most popular tools for software architecture. 44% now use AI/LLMs for documentation.
Source of truth shifted remained in collaborative wikis (42%) and dedicated tools (24%).
More than 70% were at least moderately confident using the C4 model. Context diagrams (81%) and Container diagrams (79%) were most commonly used.
AI usage is mostly experimental: 37% use it in some workflows, 33% are exploring, 19% haven’t started yet. Main uses include diagram generation, docs creation, and design validation.
Architects see their role evolving to be more strategic and quality-focused, with AI augmenting rather than replacing their work.

👥 Who answered?

There were a total of 75 responses.
More than half (57%) of the respondents were architects, with solution architects being the most common at 29%. Engineers/developers were the second largest group at 29%.
The majority of respondents were experienced full-time employees (95%), with 6–10 years (29%) and more than 10 years of experience (39%). More tenured respondents responded this year (6+ years of experience).*
There was a wide range of company sizes, with 100–499 (26%) and 5,000 or more (21%) being the most common.
Despite the wide range of company sizes, most had a small team of dedicated architects. The most common team size was 1–9 architects, at 51%.
Difference not statistically significant based on chi-square test.

🤬 Biggest challenge related to architecture

Keeping documentation up to date: Overwhelmingly, the most common challenge mentioned. People struggle with architectural drift, maintaining the current state in an agile process, and often end up with out-of-date docs that people lose trust in.
Lack of standards and consistency: No agreed-upon standard or conventions established. Teams have different interpretations of the C4 model, conflicting views of the system in the org, and scattered tooling with no central source of truth.
Finding the right level of detail: Dealing with the tension of too much detail and too little. Understanding “how to deep go” when diagramming so it’s valuable, and communicating to different audiences (developers to C-suite).
Time and resource constraints: Lack of time or the inability to prioritize documentation work. Docs are often seen as a burden rather than a value-add. Justifying the quality of docs vs the speed of delivery.

⚒️ Tooling and practices

This section focuses on specific tooling and practices for creating and maintaining software architecture documentation.

The majority of respondents used Diagramming tools (87%), followed by collaborative wiki tools (79%), to document software architecture.
65% mentioned using a modelling tool, almost identical to 2024 at 64%.
44% used AI/LLMs and diagrams-as-code. These were new options this year.
Physical whiteboarding was still going strong at 40%.

Source of truth of architecture continued to converge in 2 places:

42% said it was on a collaborative wiki tool, while 24% said they used a dedicated tool.
This is a reverse of 2024, where collaborative wiki (29%) was above a dedicated tool (37%).*
19% had no single source of truth, and 11% tracked it with code or informally. Both were up from 2024.*

* Both differences were not statistically significant.

There was no pattern for how often people updated their architecture documents. It was a close split between monthly (26%), annually (23%), quarterly (23%), and weekly (23%).
It was a close, even split on using architecture decision records (ADRs). 48% said they use ADRs, and 52% don’t.

The most common architecture patterns used were microservices (60%) and event-driven (55%). Both of these are down from 2024; however, the differences were not statistically significant.

4️⃣ C4 model

This section focuses on the adoption of the C4 model and its challenges.

Most people were moderately (29%) or very confident (41%) in using the C4 model.
There was a slight increase in overall confidence from 2024, but the difference was not statistically significant.
We asked respondents who were not confident or only slightly confident to explain in more detail why. Some example responses were: "Lack of interest in adopting it," or "haven’t spent time learning it yet."

Context diagrams were the most commonly used diagram type in the C4 model at 81%, followed by Container/app (79%), and Component diagrams (41%).
Context increased, while container and component decreased from 2024. The increase in context diagram usage was statistically significant (p=0.05/7).
Among supplementary C4 diagrams, most people used dynamic diagrams/IcePanel Flows (28%) or system landscape diagrams (27%).

🔮 AI/LLMs and the future of software architects

The final section focuses on how AI/LLMs are used and how they impact software architects. These were mostly open-ended questions.

Adoption of AI/LLMs continues to be a work in progress. Most teams are still experimenting:

37% said it’s used in some aspects of their workflow/tooling.
33% have dabbled in it, but it’s been mostly exploratory.
19% have yet to explore it.

In general, people are still in the experimental and early stages of using AI. Many have yet to use it at all and are cautious about fully trusting its output. With that said, a general theme we saw was using AI/LLMs as assistants for brainstorming and gut-checks.

⭐ Top ways people are using AI/LLMs

Diagram generation: Generating Mermaid diagrams, going from code to diagrams (reverse engineering), automating diagram updates with code changes, and creating a DSL for diagramming tools.
Creating and summarizing docs: Creating ADRs, translating text for different audiences (technical to business), and summarizing architectures.
Design validation: Using AI as a thinking partner. “Rubber ducking” designs, sanity checking decisions or asking for a second opinion, exploring potential designs and tradeoffs, evaluating architecture decisions.
Research and ideation: Brainstorming new ideas and changes, researching different patterns and concepts, asking for suggestions on what could be improved. People using it as a ‘consultancy’.

People see the role of architects evolving into something more strategic and quality-focused, shifting from a creator to a coach/facilitator. In general, people see AI as an augmentation, rather than a replacement, helping them be more efficient by removing a lot of mundane tasks, such as maintaining docs. AI will help architects focus on designing resilient, scalable systems that align with business objectives, rather than lower-level technical decisions.

🧊 That’s a wrap!

Another survey in books. Thanks so much to everyone who participated! If you’ve got any thoughts on the results or ideas for what we should focus on next time, let us know — mail@icepanel.io. We’d love to hear from you!

Stay Chill 🤙

Why engineers should use architecture diagrams

IcePanel — Mon, 12 Jan 2026 16:23:05 GMT

Introduction

Quality and speed of shipping projects are two core metrics for evaluating performance of engineering teams. Projects typically go through multiple phases like research, requirements gathering, solution design, and implementation, each of which depends on a clear understanding of teams’ software architecture. Engineers who use architecture diagrams can accelerate most of these phases (if not all) when applied effectively.

Regardless of the diagram choice, C4 Model, UML diagrams (Sequence, Activity), or cloud-specific like AWS diagrams, the goal is the same: to build shared architectural knowledge and avoid accidental complexity. This makes collaboration easier for teams when they start referencing these diagrams during whiteboarding sessions, design reviews, and documentation.

Here are four reasons engineers should use architecture diagrams:

Establish a common baseline for discussing systems
Reduce review time and cognitive load
Enable asynchronous communication
Build a documentation-first culture

Let’s go through each one.

1. Common baseline for discussing systems

Engineers are frequently asked to weigh in on product ideas and feature proposals from upstream management. Some proposals are application-specific, requiring developers to scope and implement solutions programmatically, while others are more infrastructure-specific, requiring architects to consider the knock-on effects of the architectural changes.

Architecture diagrams serve as living documentation of their systems. When kept up to date, engineers can discuss new ideas with their trade-offs and make key decisions to confidently go from design to implementation. On the flip side, introducing changes without sufficient context can create a functional risk (i.e., break systems) and/or data vulnerabilities. For example:

Deploying a message queue (SQS) in a public subnet with disabled encryption (data risk)
Adding a cache in front of a service (inconsistent data risk)
Deprecating a microservice with downstream dependencies (potential cascade failures)

All of which can be avoided if all engineers share the same knowledge baseline for discussing systems.

Generally speaking, architectural diagrams typically capture three elements: (1) data flow, (2) data storage, and (3) data processing. These elements help reviewers understand how a proposed change fits into the system and its potential downstream impact. For example, when introducing a new microservice, reviewers might ask:

Data flow: What are its fan-in and fan-out dependencies?
Storage: Is the service stateless or stateful, and where is its data stored?
Processing: What does the service produce, and how is its data modeled?

Sharing a common understanding of how and why the architecture is designed makes collaboration as an engineer much easier.

2. Reduce review time and cognitive load

This can be helpful in code reviews. Referencing diagrams in a pull request (PRs) shows that the author has considered the architectural implications of their changes. This provides reviewers with context to:

Explain the expected behaviour when the change is deployed (e.g., extra SQL queries to the core database, potentially added latency to the service).
Ensure the author has thought about higher-level system interactions (e.g., number of connections to the core database).

Reviewers can then evaluate changes without mentally reconstructing the entire system. This reduces back-and-forth questions in PRs like “How does this affect X?” or “Have you thought about Y?”.

Architecture diagrams are particularly important for platform teams managing infrastructure-as-code (IaaC) repos. Diagrams are critical in these scenarios because code/configuration changes often alter the current architecture, using tools like Terraform or CloudFormation. For example:

Deploying a new application within a private subnet.
Provisioning a new Kafka with new consumers.
Configuring apps to call a new third-party API.

These examples highlight the importance of (A) a shared architectural baseline among engineers, reducing the need for repeated explanations (especially in PRs), and (B) minimising cognitive load by referencing diagrams instead of re-drawing parts of the architecture.

3. Encourage asynchronous communication

One common mistake in team collaboration is relying on a single engineer who holds architectural knowledge, repeatedly asking them questions via Slack or 1:1 meetings. This synchronous communication creates a knowledge bottleneck and a single point of failure, especially when that engineer is on PTO or leaves the company.

A better approach is documenting infrastructure knowledge in ADRs (Architecture Decision Records) and architecture diagrams. Engineers can revisit these documents at any time, leave comments or ask questions asynchronously, and not be blocked by a single person. This empowers teams to make design decisions independently and reduces dependency on individual engineers.

4. Documentation-first culture

Encouraging engineers to create and maintain architectural diagrams fosters a documentation-first culture. It makes it easier to search for answers, and, in some setups, allows engineers to query systems that support MCP servers (for example, check out IcePanel MCP). A documentation-first culture begins with writing product requirements, ADRs, engineering proposals, and diagramming systems. This approach is similar to Amazon’s 6-pager culture, where new ideas or projects are discussed in short, structured documents. If a proposal isn’t documented, it doesn’t enter discussion. Architecture diagrams serve as valuable references in these 1-pagers and 6-pagers, reducing back-and-forth clarification about the current system and keeping discussions focused on the post-delivery.

Here’s a summary below of Amazon’s writing culture.

Conclusion

Shipping engineering projects requires a clear understanding of a team’s software architecture. Documenting architectures is an indicator of strong engineering collaboration and a documentation-driven culture. It empowers engineers to move quickly from feature idea to implementation, and discuss design trade-offs with a shared understanding of their systems.

Teams that invest time in having architecture diagrams reduce cognitive load, onboard engineers faster, and prevent accidental complexity when making changes. Over time, this leads to more reliable systems and stronger engineering-product collaboration.

To recap, architecture diagrams help engineers:

Establish a common baseline for discussing systems
Reduce review time and cognitive load
Enable asynchronous communication
Build a documentation-first culture

📚 Resources

Design eBay using IcePanel

IcePanel — Thu, 18 Dec 2025 23:32:03 GMT

📝 Introduction

In this post, we’ll share an example architecture for eBay — one of the largest online auction and marketplace platforms where users can buy and sell items or make direct purchases. We’ll design this as software architects and create four hierarchical diagrams: Context, Container, Component, and Code (not sure what these are? Read this). We’ll annotate the building blocks and highlight user interaction (data flow) within the system using IcePanel Flows.

Let’s first define the scope by writing the functional and non-functional requirements of the system. Afterwards, we’ll go through each layer of the C4 model.

You can view the final architecture at this link: https://s.icepanel.io/DWnaysJ3cbCQqg/eYV5

🔎 Scope

For an online auction like eBay, users should be able to:

Sell an item or post it for auction with a starting price.
Buy an item or place a bid with a price higher than the current bid.
View an auction for an item with the current highest bid.

For non-functional requirements, the system should have:

Strong consistency with bidding to make sure all buyers can see the same price.
Real-time notifications and scalable to ~1 million concurrent auctions.
Fault-tolerant and durable (i.e., we can’t drop any bids from buyers)

Let’s start with the first diagram, Context.

Level 1 — Context

This context view shows the interaction between end users, our main system (eBay), and the external systems it integrates with. In this design, we have two actors:

Seller: User who sells items on eBay in an auction or with a fixed price.
Bidder: User who buys items or places bids on eBay.

Ebay also depends on two external systems:

Payment Provider (Stripe): Responsible for processing payment transactions and managing the complete payment lifecycle. When a winning bidder needs to complete payment or when a seller receives funds, they are redirected to the payment provider’s secure interface, where card details are collected and processed.
SMS Provider (Twilio): Manages fast SMS delivery to users, mainly bidders. Our system publishes new bid events to a Kafka topic, where our notification service consumes that message and then forwards it to the SMS provider for notifying users.

Level 2 — Container

This diagram is where we model a collection of independently deployable or runnable applications or data stores that are essential for the overall software system to function. This could be a web application, server, datastore, or a streaming data platform like Apache Kafka.

This system is an event-based architecture. This approach is ideal for high-throughput bidding scenarios where services process messages asynchronously whenever an item receives a bid. Events are triggered using Kafka topics, allowing the system to handle traffic spikes gracefully while maintaining strong consistency guarantees.

Our system is composed of the following Containers:

Auction Service: Manages the complete auction lifecycle, including creating new auctions when sellers list items, reading auction details and current bid status, and updating auction metadata (start time, end time, status).
Bid Producer: Responsible for accepting and validating incoming bids. It validates bid amounts with atomic checks on Redis, publishes bids to Kafka for asynchronous processing, and atomically updates the bid cache.
Bid Service: Consumes bid events from Kafka and provides durable persistence. It consumes bid events from a Kafka topic, persists all bids to the Auction Database (bids table) for audit trail, and updates the auction’s current price in the database.
Notification Service: Manages all user communications. It consumes bid events and auction lifecycle events from Kafka, sends real-time notifications when users are outbid, and sends winning bid confirmations.

Note: Real-time solutions like WebSocket connections for live auction updates and publish/subscribe patterns are outside the scope of this design. This architecture focuses on the core bidding flow, auction management, and durable event processing.

In this architecture, we use the following technologies:

AWS API Gateway: A scalable and secure entry point for all API requests coming from the frontend. This object is represented using connection via property.
AWS EC2: Web service that provides reliable and secure compute for the different microservices (Auction Service, Bid Producer, Bid Service, Notification Service, Payment Service).
Apache Kafka: A distributed event streaming platform that provides high throughput (millions of messages per second), durability (persists all messages to disk with replication), and ordering guarantees (partitioning by auction_id ensures sequential processing). Perfect for handling high-volume bidding periods while ensuring no bids are lost. This object is represented using a connection via property.
PostgreSQL: An ACID-compliant relational database that ensures data integrity and strong consistency. Ideal for structured data like auctions and bids tables where transactional guarantees are critical.
Redis: A fast in-memory cache that improves response times (sub-millisecond latency) and reduces database load. Essential for serving current bid prices to 1 million concurrent auctions.

We’ve designed two common data flows using IcePanel. Check out these flows and play them step by step to see how our system works:

Create a new auction listing: https://s.icepanel.io/DWnaysJ3cbCQqg/SpZW
Place a bid on an auction: https://s.icepanel.io/DWnaysJ3cbCQqg/rmpv

Level 3 — Component

In the C4 model, a component is a grouping of related functionality encapsulated behind a well-defined interface. For example, a collection of classes behind an interface. Let’s look at the Components in eBay’s auction system.

1. Bid Producer

The diagram below shows how a user interacts with the Bid Producer to place a bid on an auction. When a user sends a POST /bid request, the Bid API first validates the request and extracts the auction_id and bid_amount. It then retrieves the current highest bid from the cache (using optimistic locking) and checks if the new bid amount exceeds it. If the bid is too low, an error is returned to the bidder. This prevents race conditions when multiple bidders submit bids simultaneously.

Once the Redis lock is acquired, the Bid Kafka Publisher sends a new bid event to the bid-events topic in Kafka. The bid event includes auction_id, user_id, bid_amount, and timestamp. After publishing to Kafka, the service atomically updates the current highest bid in Redis (auction:{id}:price) before releasing the lock.

2. Auction Service

The Auction Service manages the complete auction lifecycle and provides read/write access to auction data. When a user requests auction information via GET /auctions/:auctionId, the Auction API first checks the Auction Cache (Redis) to retrieve hot auction data with sub-millisecond latency. If the data isn’t cached, the Auction Controller queries the Auction Repository, which fetches auction details from the Auction Database (PostgreSQL).

The Redis Client manages all cache operations, storing frequently accessed auction data with time-to-live (TTL) values to reduce database load. The Auction Repository handles all database interactions, including creating new auction listings when sellers post items, updating auction metadata (start time, end time, reserve price), and querying active auctions with filters.

When an auction ends, the Auction Controller determines the winning bidder by querying the highest bid from the database. It then triggers the Auction Payment Controller, which initiates the payment flow by calling the Payment Service. The Payment Controller handles payment confirmations, failures, and refund processing for cancelled auctions.

3. Bid Service

The Bid Service handles all bid processing in our online auction system. When a bid event is published to Kafka, the Bid Consumer polls messages from the bid topic and processes each bid event asynchronously. It interfaces with the Bid Repository, which serves as the data access layer for storing and retrieving bid information. The Bid Repository executes SQL queries against the Auction Database (PostgreSQL) to persist bids into the database and maintain bid history.

Let’s go one level deeper with the source code in the Code layer.

Level 4 — Code

This is where we can view implementation details at the code level. We don’t recommend creating extensive diagrams for this; instead, we link directly to the code. However, here’s the general structure of these components for reference. We’ll briefly go over two Components: Auction and Bid Producer.

Classes in “Auction Service”

This component manages the auction lifecycle and real-time auction data access. The Auction API class exposes RESTful endpoints for retrieving auction details, current high bids, and auction status. The API first checks Redis cache for hot auction data before querying the database. The Auction Controller coordinates between cache, repository, and payment processing. The Auction Repository serves as the data layer for fetching auction information from the Auction Database.

Auction Controller

GET /auctions/:auctionId — Retrieve auction details (item, description, current bid, etc.)
GET /auctions/:auctionId/bids — Get bid history for specific auction
POST /auctions — Create a new auction listing
PATCH /auctions/:auctionId — Update auction details
DELETE /auctions/:auctionId — Cancel auction

Redis Client

getCachedAuction(auctionId) — Retrieve auction data and current high bid from cache
cacheAuction(auctionId, auctionData, ttl) — Store auction data with time-to-live

Auction Repository

getAuctionById(auctionId) — Fetch complete auction details from database
getCurrentHighBid(auctionId) — Retrieve current winning bid amount and bidder
updateAuctionStatus(auctionId, status) — Change auction state (active, ended, cancelled)
getActiveAuctions(filters) — Query active auctions with filtering and pagination

Auction Payment Controller

initiatePayment(auctionId, winnerId) — Start payment process for auction winner
refundBid(bidId) — Process refunds for cancelled auctions or failed transactions

Classes in “Bid Producer”

This component handles the bid submission workflow with strong consistency guarantees. The Bid API class provides RESTful endpoints for users to place bids and check bid status. The Bid Validator ensures bid amounts meet minimum requirements and auction eligibility rules. The Optimistic Lock Manager implements Redis-based locking to prevent race conditions during concurrent bidding. The Kafka Producer handles asynchronous event publishing for downstream bid processing and notifications.

These classes have roughly the following methods:

Bid API

POST /bids — Place a new bid on an active auction
GET /bids/:bidId — Retrieve details of a specific bid
GET /auctions/:auctionId/bids — Get all bids for an auction

Cache Client

getCurrentBid(auctionId) — Retrieve current bid for specific auction
updateCurrentBid(auctionId, bidAmount) — Update cached high bid in real-time

Kafka Producer

publishBidPlacedEvent(bidData) — Send bid creation event to Kafka topic
publishBidOutbidEvent(userId, auctionId) — Notify when user is outbid

Conclusion

In this post, we designed an online auction platform (eBay) using the C4 model on IcePanel. We began with the core requirements like strong consistency, real-time updates, and fault tolerance, and modeled the system from the top down, starting with the Context layer, followed by the Container, Component, and finally the Code layer.

If you enjoyed this post, check out our previous design deep dives using IcePanel:

📚 Resources

What’s the purpose of software architecture diagramming?

IcePanel — Tue, 09 Dec 2025 19:20:37 GMT

Introduction

The concept of software architecture started in the 1960s, during the time when the idea of object-oriented programming (OOP) was introduced. Before that, programming was low-level and written in assembly language. In the 1970s, higher-level programming like C/C++ started to gain popularity, programmers began theorising about how to structure software, exploring ideas around modularity and introducing different architectural styles. Some of the most well-known architectures include monolithic, microkernel, and microservices.

Diagramming became an essential engineering practice for designing software. It started with a modelling language called Unified Modelling Language (UML) for object-oriented programming, then evolved to include more flexible approaches, like ArchiMate for enterprise architecture, and C4 Model for modern software systems. Teams in non-enterprise environments often default to pragmatic approaches for communicating architecture, using simple boxes and arrows as the basis for diagramming.

Purpose of software architecture

Some engineers perceive architectural diagramming as a secondary piece of documentation that they optionally produce and the intended audience is their teams only. This is a common misconception that undervalues the real purpose of software architecture. In reality, architectural diagramming is:

Not secondary: Designing architecture is as important as writing code that runs the system.
Not optional: it should be defined before building a new system or implementing a feature as it clarifies dependencies and interactions.
Not only: Architecture diagrams also serve the wider organisation and other stakeholders by documenting the key architectural decisions to maintain the system.

Martin Fowler, Chief Scientist at Thoughtworks, is one of the most influential people in software architecture. He emphasised the importance of software design through his books like Patterns of Enterprise Application Architecture and Refactoring. In a well-known email exchange with Ralph Johnson, he defined architecture as the shared understanding that the expert developers have of the system design. Check out his 15-minute talk on YouTube where he explains what architecture is and why it matters.

Fundamentally, software architecture defines the organisation of a system: how components communicate and, when grouped together, ultimately deliver value to users. Think of it as the difference between a booking system (architecture) versus an individual payment (component within architecture). Here is an example of an AWS architecture below:

To document software architectures, diagramming is an important exercise for visually representing component relationships and communicating architectural decisions clearly. That’s why system design interviews are widely used to assess a candidate’s ability to think architecturally and communicate trade-offs when designing scalable systems.

Common Mistakes

Some mistakes people make when designing software architectures are:

1. Focusing on too much detail.

Some engineers tend to approach diagramming as a way to enumerate every component and all of its edge cases. For example, say there is a web service that calls multiple endpoints to complete a specific flow (e.g., a checkout flow). Some engineers will want to highlight this flow in their architecture with all the edge cases to make it comprehensive. However, that is not the goal of the architecture, that belongs in an activity diagram. The purpose of a software architecture diagram is to highlight the current state of the system and the key decisions behind the design.

2. Serving all stakeholders with a single diagram.

Each type of audience has a requirement from the architectural diagrams. Product and design want a brief, high-level understanding of how the major pieces fit together. Engineers want to see more than high-level, they want to understand the technical details and decisions made behind each component. For example, product managers might want to understand the mobile app integration project through the Context layer or Container layer of the C4 model, while engineers need more details like the authentication flow in the Component layer.

Even with AWS diagrams, architecture can be split into layers of abstraction similar to the C4 model. Product teams can view high-level diagrams that explain the customer onboarding process for their new SaaS product (e.g., user dashboard, analytics, or named business services), while engineers can reference low-level architecture diagrams that focus on networking components (such as VPCs, subnets, and data access patterns) and services (such as EC2 and ECS) that run the infrastructure.

A single diagram cannot effectively serve all audiences, it will either be too simplistic or too complex for at least one group.

3. Mixing levels of abstraction.

Mixing levels of abstraction is a common trap in architecture diagramming. Some engineers tend to show high-level components like web services or datastores alongside lower-level details like OS processes or network connections, often making it unclear what the viewer should focus on or what is important. A better approach is to be intentional about the level of detail you’re presenting. High-level diagrams should communicate the overall system structure and architectural decisions, while separate, detailed diagrams can expand on specific modules or flows. This way, each diagram has a clear purpose and is easier for its intended audience to understand. A good way to avoid mixing abstractions is to follow a structured approach like the C4 Model.

Future of software architecture with AI

TL;DR: a “vibe-diagramming” phase will soon disrupt the industry, but the need for software architects will still exist.

Current LLMs like GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5 are the state-of-the-art models. They excel at generating production-quality code, interpreting large contexts (e.g., monorepos), and building deep understanding and reasoning within enterprise systems. However, they still struggle with nuanced design and architectural decision-making (try to ask it to design a Ticketmaster). For now, LLMs can generate simple language-based diagrams like Mermaid.js and help interpret existing architectures, but designing and maintaining the overall architecture of a complex system still requires human expertise.

The past few years have been a disruptive shift to the software engineering industry, with predictions that AI would replace software developers entirely. That hasn’t happened. Human involvement remains essential for strategic thinking and problem-solving. We believe the same pattern will apply for software architects.

AI will eventually transform the software architecture domain, but the fundamentals won’t change. Even as LLMs improve at diagramming, software architects will still be needed for understanding business constraints, making trade-offs, and aligning technical decisions with business goals. The future with AI will be an accelerator for architects, not a replacement.

For interpretation, you can explore architecture diagrams by querying an MCP server (e.g., see IcePanel’s MCP). However, designing or maintaining architectures with AI is still in its early stage. Software architects can use AI tools to interpret architecture, but not yet to create it.

Conclusion

Software architecture diagramming has evolved from early theories of software design in the 1960s and formal modeling languages like UML to the flexible, pragmatic approaches that serve modern development teams today. The core purpose remains unchanged: to visually communicate how systems are organised, document key architectural decisions, and create shared understanding across teams.

While we’ve discussed common mistakes to avoid and good practices to follow, the landscape is shifting as AI tools begin to enter the architectural workflow. The question isn’t whether AI will impact how architects diagram and design systems, but rather how architects can best leverage these emerging capabilities to enhance their work.

As AI tools continue to advance, they’ll augment architects’ capabilities in interpreting and generating diagrams, but human involvement will remain necessary. The future belongs to architects who leverage AI as a design assistant while maintaining their role as strategic technical leaders who understand business constraints, evaluate trade-offs, and align technical decisions with organisational goals.

If you enjoyed this piece, check out our related post on how AI is redefining the role of a software architect: https://icepanel.io/blog/2025-07-21-how-ai-is-redefining-the-role-of-a-software-architect

📚 Resources

The best alternatives to Miro for software architecture diagrams

IcePanel — Tue, 25 Nov 2025 18:25:18 GMT

1. LucidChart

Started in 2008
Similar to Draw.io
Pricing: Free and paid plans (starting at $12/mo/user for teams)
Best for medium-sized technical teams

What is LucidChart?

LucidChart is a general-purpose diagramming tool to create everything from flowcharts, org charts, and architecture diagrams. Easy to use, collaborative, and capable of doing many different things. If you’re coming from Miro, you’ll find the LucidChart experience similar, perhaps slightly more difficult in terms of usability.

Main features

👍 Easy to use with lots of customization for shapes and lines
📊 Supports many different diagram types (UML, C4 model, mind maps, flowcharts)
🫂 Real-time collaboration and integrations with a variety of tools (Atlassian, Teams, Slack)
⏬ Import from Draw.io, Visio, Gliffy, and other similar tools

How does LucidChart compare to Miro?

Pricing

LucidChart: Free and paid plans for teams starting at $12/mo/user (minimum 3 seats on Team plan).
Miro: Free and paid plans for teams starting at $10/user/mo.

Shape and Templates Library

LucidChart: Premium shape & template library on Team plan; custom shapes available.
Miro: Premium shape & template library on Business plan; custom shapes available.

Integrations

LucidChart: Integrates with Atlassian (Confluence, JIRA), Google Drive, Teams, Slack, and more.
Miro: Integrates with Atlassian, Google, Microsoft 365, and more.

UI/UX

LucidChart: Simple UI, but slightly harder to use than Miro.
Miro: Easy-to-use drag-and-drop UI.

Collaboration

LucidChart: Real-time collaboration with commenting.
Miro: Real-time collaboration with commenting.

AI Features

LucidChart: Generate, iterate, and summarize diagrams with AI.
Miro: Generate, iterate, and edit text with AI; agents with “sidekicks.”

Enterprise

LucidChart: Enterprise plan available.
Miro: Enterprise plan available.

Best for: Teams looking for a general diagramming tool that can do a bit of everything. Ideal if software architecture diagrams are ephemeral and don’t need to exist for long-term use.

2. Draw.io (diagrams.net)

Started in 2000
Similar to LucidChart
Pricing: Free (open-source), but paid on Atlassian after 10+ users
Best for small teams on the free option. Good for large teams with Atlassian

What is Draw.io?

Draw.io is an open-source diagramming tool with a drag-and-drop UI and large shape library, allowing you to create different types of architecture diagrams. The company’s mission is to “provide free, high quality diagramming software for everyone.” Using the open-source version allows you to freely create, edit, and save diagrams in your preferred workspace. For team’s already using Atlassian products (Confluence), you can also download/purchase the Draw app from their marketplace and use it natively.

Main features

🖌️ Intuitive drag-and-drop editor with customization options.
📦 Flexible diagram storage. Diagrams can be saved in Google Drive, Microsoft OneDrive, or locally on the desktop app.
👥 Real-time collaboration with commenting.
🤝 Atlassian (Confluence and JIRA) integration. Work directly in the tool without leaving.
💰 Free and paid options with enterprise-level security

How does Draw.io compare to Miro?

Pricing

Draw.io: Generous free option with paid plans in the Atlassian marketplace.
Miro: Free and paid plans for teams starting at $10/user/mo.

Shape and Templates Library

Draw.io: Large shape library for architecture diagrams (UML, C4, etc.).
Miro: Premium shape and template library on Business plan; custom shapes available.

Integrations

Draw.io: Integrates with Google Drive, OneDrive, GitHub, Confluence, and Notion.
Miro: Integrates with Atlassian, Google, Microsoft 365, and more.

UI/UX

Draw.io: Simple and easy-to-use UI.
Miro: Easy drag-and-drop UI.

Collaboration

Draw.io: Real-time collaboration with commenting via Google Drive, OneDrive, or Atlassian.
Miro: Real-time collaboration with commenting.

AI Features

Draw.io: AI-driven smart templates and generated diagrams, more basic.
Miro: Generate, iterate, and edit text with AI; agents with “sidekicks.”

Enterprise

Draw.io: Enterprise security available via Atlassian standards.
Miro: Enterprise plan available.

Best for: Teams wanting a free, versatile diagramming tool that can be saved in their existing workspace. Also great for teams working in Confluence or JIRA who want something that feels native.

3. IcePanel

Started in 2021
Similar to LucidChart and Miro
Pricing: Free and paid options (Starting at $40/mo/editor annually)
Best for medium to large teams

What is IcePanel?

IcePanel is a collaborative diagramming and modelling tool based on the C4 model. It allows you to create hierarchical diagrams while maintaining a single source of truth in a model. It’s a lightweight tool that helps teams design software architecture at scale with structure and consistency. Compared to Miro and other diagramming tools, it’s mainly built for software architecture design and docs.

Main features

🔢 C4 model diagrams (Level 1 to Level 3)
🔀 Communicate user journeys with Flows and Tags
🧱 Maintain and reuse objects with a model
💡 Collaborate on ideas in Drafts and track changes with Versions
🤖 Export and connect model with LLMs with MCP and LLMs.txt exports

How does IcePanel compare to Miro?

Pricing

IcePanel: Free and paid plans for teams starting at $40/mo/user annually.
Miro: Free and paid plans for teams starting at $10/user/mo.

Model-based

IcePanel: Yes — provides a single source of truth.
Miro: No.

Architecture Design

IcePanel: Create Flows to communicate data flows and user journeys. Layered diagrams for consistent structure. Draft future-state views of architecture.
Miro: None.

Integrations

IcePanel: Embed diagrams in Confluence, Notion, Miro, SharePoint via iFrame.
Miro: Integrates with Atlassian, Google, Microsoft 365, and more.

UI/UX

IcePanel: Easy-to-use drag-and-drop UI.
Miro: Easy-to-use drag-and-drop UI.

Collaboration

IcePanel: Real-time collaboration with commenting.
Miro: Real-time collaboration with commenting.

AI Features

IcePanel: Export diagrams as LLMs.txt and MCP integration.
Miro: Generate, iterate, and edit text with AI; agents with “sidekicks.”

Enterprise

IcePanel: Enterprise-level security available on Scale plan.
Miro: Enterprise plan available.

Best for: Teams that want to consistently design and document their software architecture without the complexity of learning a new syntax.

4. Eraser

Started in 2020
Similar to Draw.io
Pricing: Free and paid plans (starting at $12/mo/user monthly)
Best for small to medium-sized teams

What is Eraser?

Eraser is an AI-based diagramming and docs tool for technical teams. Eraser lets you generate diagrams from prompts, connect diagrams to code to keep them in sync, and create docs in context with diagrams.

Main features

👍 Drag-and-drop UI and diagrams-as-code support
💼 Focused on technical design and docs
🤖 AI features to generate and iterate on designs
🫂 Real-time collaboration and commenting
🔁 Integrations with GitHub, Notion, Confluence, and VS code extension

How does Eraser compare to Miro?

Pricing

Eraser: Free and paid plans for teams starting at $12/user/mo.
Miro: Free and paid plans for teams starting at $10/user/mo.

Shape and Templates Library

Eraser: Core shapes and architecture templates; custom shapes available on paid plans.
Miro: Premium shape and template library on Business plan; custom shapes available.

Integrations

Eraser: Embed diagrams in Notion and Confluence; commit files and diagrams to GitHub.
Miro: Integrates with Atlassian, Google, Microsoft 365, and more.

UI/UX

Eraser: Drag-and-drop UI with diagram-as-code; harder to use than Miro.
Miro: Easy-to-use drag-and-drop UI.

Collaboration

Eraser: Real-time collaboration with commenting.
Miro: Real-time collaboration with commenting.

AI Features

Eraser: Generate and iterate using AI.
Miro: Generate, iterate, and edit text with AI; agents with “sidekicks.”

Enterprise

Eraser: Enterprise plan available.
Miro: Enterprise plan available.

Best for: Teams that are looking for a technical solution to design and docs with AI features. Ideal for teams looking for a tool that leans more on diagrams-as-code.

5. Excalidraw

Started in 2020
Similar to Draw.io
Pricing: Free (open source) and paid plans (starting at $7/mo)
Best for small teams to medium sized teams.

What is Excalidraw?

Excalidraw is a simple white boarding tool with real-time collaboration. It’s informal visual style makes it a popular choice for running meetings, brainstorms, or doing quick and dirty diagrams. The free plan requires no sign-up, and is a great option if you need to sketch something out quickly. The paid (Plus) plan offers similar functionality to Miro, at a slightly lower price point with less security and AI features.

Main features

🖌️ Simple and intuitive UI/UX
🫂 Real-time collaboration and commenting (on paid plan)
📚 Public library for shapes and templates
🤖 Gen AI capabilities (text to diagram, wireframe to code)

How does Excalidraw compare to Miro?

Pricing

Excalidraw: Free and paid plans for teams starting at $6/user/mo.
Miro: Free and paid plans for teams starting at $10/user/mo.

Shape and Templates Library

Excalidraw: Shapes and templates available from the public library.
Miro: Premium shape and template library on Business plan; custom shapes available.

Integrations

Excalidraw: Integrates with Notion and Obsidian.
Miro: Integrates with Atlassian, Google, Microsoft 365, and more.

UI/UX

Excalidraw: Easy-to-use drag-and-drop UI.
Miro: Easy-to-use drag-and-drop UI.

Collaboration

Excalidraw: Real-time collaboration with commenting.
Miro: Real-time collaboration with commenting.

AI Features

Excalidraw: Text-to-diagram and wireframe-to-code.
Miro: Generate, iterate, and edit text with AI; agents with “sidekicks.”

Enterprise

Excalidraw: None — no SSO.
Miro: Enterprise plan available.

Best for: Teams looking for quick ad-hoc architecture diagrams at an affordable price.

Choosing the right alternative

Ultimately, the right tool for you depends on your needs, budget, and team’s willingness to learn a new tool.

If you’re constrained by budget, Draw.io or Excalidraw offer free solutions.
If you want a tool that’s lightweight, easier to maintain, and structured, try IcePanel.
If your team’s already using Atlassian products, Draw.io is a good option.
If you want to automate and version your diagrams, try Eraser.

Any tools that we missed? Let us know!

Stay Chill 🤙