Continuous Threat Modeling for Internet-Scale Backend Platforms

Luigi Iacuaniello — Sun, 11 Jan 2026 20:57:09 GMT

If you work on internet-scale backends, you know very well that threat modeling (when present) is often just a deliverable: it becomes true for a few days and false for the entire lifecycle.

The thesis is simple: within an organization it must become an engineering practice: repeatable, connected to the backlog and to Architecture Decision Records; producing verifiable evidence such as tests, alerts and policies.

Why internet-scale is different

In modern systems the real topology changes faster than our ability to describe it — microservices, event-driven architectures, caching, cloud, dependencies.
The classic threat model fails, but not because it is wrong, it is slow.
Services are born, evolve, die quickly; feature flags alter classic flows; autoscaling acts on backpressure and queues; and new dependencies come into play immediately.
Asynchrony is a relevant element in these scenarios and it is important to ask: who authorized what? For how long is this operation valid?
The risk is no longer the single vulnerability but trust: who can perform certain operations, what happens when there is a breach or abuse at scale.
The dominant risks are systemic errors in trust boundaries, multi-tenancy, economic abuse/DoS, and compromise of the supply chain/control plane.

Therefore, the threat model must live in dynamism: in delivery.

Otherwise it is just literature for its own sake.

The real rule: start modeling trust boundaries, not components

The attack surface is almost always in the transition between services: delegated authentication, data transformations, retries, idempotency, or tenant propagation.

Each boundary is a point where assumptions change; each change is a new shared cache, a new provider, a new authentication mode, and repeating a (smaller) threat modeling cycle.

External client → API Gateway / WAF   
Gateway -> service-to-service / mesh   
Service ->storage (DB, cache, object store)   
Service ->event bus / queue   
Runtime ->control plane (Kubernetes, IAM, CI/CD)   
App ->external dependencies (payment, email, KYC, LLM provider)

Scope

The first mistake is wanting to model everything; another classic mistake is having no output — alignment meetings have to end at some point, right?
The real and useful scopes that are needed and that work are:

Single service: for targeted hardening or risky refactor;
Domain: to establish invariants (authZ, multi-tenant, logging);
End-to-end flow: to uncover trust boundaries and dependencies.

Define in-scope and out-of-scope without ambiguity.

The minimum acceptable outputs are not discussions. They are:

Backlog of controls (technical stories with acceptance criteria);
Test cases (positive and negative, including regressions);
Logging/audit and alerting requirements;
Incident response runbook with clear ownership.

It is important to know what to deliver, but above all, what to protect.

Asset inventory

This is where asset inventory comes into play: what are we protecting?

The critical asset here is no longer the data, but above all the individual responsibilities: creating a new resource, moving money, deploying code, changing a configuration.

And this is where the following immediately come into play:

Identities and tokens (users, services, sessions, JWT/opaque, refresh);
Secrets and keys (KMS/Vault, signing key, mTLS, webhook secret);
Sensitive data (PII, finance, metadata, logs);
High-impact business functions (payments, provisioning, admin, export);
Control plane (CI/CD, Kubernetes, IAM, registry, DNS, secrets manager).

With the last point, many threat models fail: the infrastructure is not a trusted environment, therefore a pipeline compromise makes protecting only the application irrelevant.

Security-oriented diagrams as a design review cycle

Diagrams are generally not pure aesthetics, there is semantics, and they help identify points where the level of trust or security context changes.

It is important to highlight:

Ingress/egress;
Real authN/authZ points;
Storage and caching (especially shared);
Asynchronous channels;
Data transformations.

Generally you do not need too many:

DFD for boundaries and data handling;
Sequence diagrams for authZ, replay, idempotency, race conditions;
A revised C4 model, with trust boundaries, egress and privileges highlighted.

STRIDE is only the beginning…

STRIDE is useful, but not enough. It does not account for operations, scale, and business. How do we apply threat modeling? Which ones do we apply?

My experience leads me to state that it is useful to apply threat modeling per trust boundary, not per single component, and to choose at most 5 top threats per boundary, not to reason about all possible threats, but about the priority ones that have:

High impact if they happen;
A real probability of occurring, considering the system in its current state.

A quick example: end-to-end multi-tenant flow:

Client -> Gateway -> Service -> DB/Redis -> Kafka -> Billing -> Payment Provider

Considering classic STRIDE, we would have:

Spoofing: JWT reused cross-tenant, service-to-service identity based on trusted headers, forged webhooks;
Tampering: Kafka events produced by unauthorized parties, tenantId altered, out-of-sequence updates;
Repudiation: disputes without an end-to-end correlatable audit trail;
Information Disclosure: cache not tenant-aware, enumeration, logs with token/PII;
DoS: retry storm, Kafka backlog, expensive endpoints that can be abused;
Elevation of Privilege: jump from user to admin, from app to control plane.

Integrating other models is necessary:

LINDDUN: great when dealing with PII, tracking, correlations and inference;
PASTA (risk-centric process): links threats to business impact and to tests/countermeasures, useful for critical platforms and domains;
Attack trees / abuse cases: particularly effective for business logic abuse and scenarios how the attacker monetizes;
MITRE ATT&CK: helps model intrusions, lateral movement and persistence (especially in the control plane);
Kill chain / intrusion lifecycle: to reason about prevention, detection and response.

Multi-tenancy: the invariant is serious

In B2B multi-tenant models, no operation can read or write outside the correct tenant.
That sentence is a testable property, explicitly modelable; in particular, tenant resolution (from token/claims, path, mapping) and strong binding to identity are important, then choose the isolation model (DB-per-tenant, schema-per-tenant, row-level, or application enforcement) with related failure modes.
You must include cache/queue (where leakage is frequent) and analytics/reporting (where joins and aggregations can accidentally mix tenants).
A flaw here is not a bug. It is a systemic breach.

Abuse-driven modeling

In systems at this scale, threat modeling must include the economics of the attack.

Typically: brute force/credential stuffing, enumeration, scraping, business logic abuse, and DoS (saturation is expensive).

Controls are often resilience-oriented:

Multi-level rate limiting (IP, account, tenant, endpoint class);
Quotas and budgets per tenant with hard enforcement;
Aggressive timeouts, backpressure and circuit breaker;
Async jobs for expensive operations;
Anti-automation protections (device fingerprinting, step-up auth, proof-of-work where it makes sense);
And for the control plane and the supply chain?

Modern attacks ignore the application; in a serious threat model it is expected:

CI/CD: secrets, permissions, approvals;
Kubernetes: least RBAC, admission policy, network policy;
Registries and dependencies: base images, updates, provenance

Here, the dear MITRE ATT&CK mentioned a little above becomes real; it describes realistic techniques, defines detection and incident response.

Not modeling this means preventing everything. In cybersecurity this is false.

Formalize everything, especially key decisions

All architectural decisions that impact security must be written in a Security Architecture Decision Record.
In a context like this, the system drifts if tomorrow a new team “optimizes” something.

JWT vs opaque token (and revocation strategy, more complex for JWTs);
Tenant isolation (app-only, DB RLS, double enforcement);
Cache strategy (tenant-aware keying, invalidation, TTL, stampede protection);
Egress policy and allow-listing;
Retry, idempotency and event deduplication.

The development team owns the threat models and the security team provides framework and review.

Do not incentivize bureaucracy…

With every new boundary introduced, in the delivery process a scalable loop is useful, with micro-sessions of even a few minutes involving a small number of people, updating diagrams and having immediate outputs.

Each top threat introduced should produce at least:

An implementable control (story);
A test/verification (automatic if possible);
Storable evidence (ADR, config, dashboard, runbook).

And how to understand whether we are going in the right direction? With a Key Performance Indicator (KPI):

% of changes that modify trust boundaries accompanied by a security ADR;
authZ/multi-tenant tests introduced;
time between new dependency and associated policy;
recurring incidents for the same root cause;
control plane drift.

Conclusion

In this context, performance and security are the same thing: a cache, a retry can introduce new attack surfaces.
Today’s threat modeling is driven by trust boundaries, produces code, tests and above all decisions.
Documentation files age very quickly.

Stories by Luigi Iacuaniello on Medium