DEV Community: Infraforge

How we recovered tfstate after force-unlock raced a CI apply

Muhammad Hassaan Javed — Tue, 19 May 2026 22:37:05 +0000

The engineer pinged us at 4:48 pm on a Thursday. They had been trying to push a small IAM change to staging, terraform apply had failed with Error acquiring the state lock, and they did what most of us have done at least once: they ran terraform force-unlock with the ID from the error message and re-ran apply. The apply went through. Ten minutes later a teammate on a different branch ran terraform plan and the plan output wanted to destroy and recreate 38 resources that were sitting healthy in AWS, returning 200s, serving traffic. By the time we joined the bridge, the original engineer was halfway convinced they needed to let Terraform rebuild the whole staging environment. They did not. The cloud was fine. The state file was the thing that was broken.

Problem signals:

terraform plan shows -/+ destroy and recreate for resources nobody touched and that are healthy in the cloud
Teammates see Error: state snapshot was created by Terraform v1.5.7, which is newer than current v1.5.4
S3 bucket versioning shows two or three tfstate writes inside a 60 to 90 second window
The DynamoDB lock table is empty but the state file timestamps do not line up with anyone's apply log
Someone on the team ran terraform force-unlock in the last hour

A stale lock from a dead CI job

What the engineer thought it was

The first wrong model was reasonable. The engineer saw Error acquiring the state lock, looked at the lock ID, did not recognize it, and assumed it was a leftover from a CI job that had crashed earlier in the week. They had seen stale locks before. The fix last time was force-unlock. So they ran it again.

What they did not check was whether the lock holder was actually still alive. The CI job that held the lock was a scheduled terraform plan cycle running on a 15-minute cadence, and that particular run was on the slow side because the workspace had grown to about 600 resources. It was not stuck. It was just working. The force-unlock removed the lock entry from DynamoDB while the CI process was still very much holding an in-memory version of the state file, mid-refresh. Two writers, no coordination.

When the engineer's apply finished, it wrote its version of the state to S3. About forty seconds later, the CI run finished its refresh and wrote its version of the state to S3 on top of that. Two non-linear writes, each thinking it had the latest state, each clobbering parts of the other. S3 versioning preserved both, but the live state pointer was pointing at a Frankenstein.

Three S3 versions in 90 seconds, and a plan that wanted to destroy healthy infrastructure

The moment the real cause became visible

We pulled the S3 object versions for the state file first. That is the single most useful command in a Terraform state incident, and most teams do not run it until someone external suggests it.

aws s3api list-object-versions \
  --bucket acme-tfstate-staging \
  --prefix env/staging/terraform.tfstate \
  --query 'Versions[?LastModified>=`2024-01-18T16:45:00Z`].[VersionId,LastModified,Size]' \
  --output table

# Output (abridged):
# VersionId                          LastModified               Size
# 9f3aV2.JqL...                      2024-01-18T16:51:12Z       412847
# 8h2nB1.KpM...                      2024-01-18T16:50:31Z       408992
# 7g1mA0.LoN...                      2024-01-18T16:49:48Z       411203
# 6f0lZ9.MnO...                      2024-01-18T16:42:15Z       411198   <-- last known good

Three writes inside 84 seconds. The 16:42 version was the last clean write before the collision.

Three writes in 84 seconds was the smoking gun. A healthy workspace writes state once per apply, and the next write is usually hours away. Three writes that close together meant at least two processes had been racing. We cross-checked against the CI logs and the engineer's shell history and confirmed: the CI plan cycle had been refreshing state from 16:49:48 onwards, the engineer's force-unlock landed at 16:50:18, the engineer's apply wrote state at 16:50:31, and the CI refresh wrote its stale view back at 16:51:12. The 16:51 write was the one Terraform was now reading, and it had been built from a refresh that started before half the engineer's changes existed.

That explained the plan output. The state Terraform was reading said the resources had attributes that did not match reality. Plan diffed state against the cloud, saw the mismatch, and proposed the only thing it knows how to propose: destroy and recreate. The cloud was correct. The state was lying. If we had let the apply run, we would have taken a healthy staging environment offline for somewhere between 40 minutes and two hours to rebuild things that did not need rebuilding.

Restore the pre-collision state version, then import only what actually drifted

How we worked through it

The recovery had two parts and an order that mattered. First, replace the corrupted live state with the last clean S3 version. Second, figure out which resources genuinely changed during the collision window and re-import only those. Skipping the second step is how teams end up with the same incident a week later, because real changes from the engineer's apply have been silently rolled back.

Before touching anything we pulled a local backup of the current (broken) state. If our restore went wrong, we wanted a way back.

# 1. Backup the current broken state to local disk
aws s3api get-object \
  --bucket acme-tfstate-staging \
  --key env/staging/terraform.tfstate \
  ./tfstate.broken.$(date +%s).json

# 2. Restore the last known good version in place
aws s3api copy-object \
  --bucket acme-tfstate-staging \
  --key env/staging/terraform.tfstate \
  --copy-source 'acme-tfstate-staging/env/staging/terraform.tfstate?versionId=6f0lZ9.MnO...' \
  --metadata-directive REPLACE

# 3. Confirm the active version is now the restored one
aws s3api head-object \
  --bucket acme-tfstate-staging \
  --key env/staging/terraform.tfstate \
  --query 'VersionId'

The copy-object call writes the old version as a new current version. Do not delete versions; you want the audit trail intact.

With the state restored, we ran terraform plan. The output was much shorter, around six resources, and they were the ones the engineer had actually changed in their apply. That was the divergence window: changes that had been made for real in AWS but that the restored state did not know about. Each of those needed a terraform import to reattach the live resource to the state. We did them one at a time, ran plan between each, and watched the diff shrink.

# Example: the engineer had created a new IAM role during their apply.
# The restored state predates it, but the role exists in AWS.

terraform import \
  module.platform.aws_iam_role.svc_runner \
  acme-staging-svc-runner

# After each import, re-run plan and confirm the resource is no longer in the diff.
terraform plan -out=/tmp/plan.out

# Repeat for each resource genuinely changed during the divergence window:
# - 1 IAM role
# - 1 IAM role policy attachment
# - 2 security group rules
# - 1 SSM parameter
# - 1 Lambda permission

Import surgically. Do not bulk-import; you want a clean plan after each step so you can spot collateral damage.

After the sixth import, terraform plan returned No changes. That was the success signal. The state matched the cloud, the engineer's intended changes were preserved, and nothing healthy had been destroyed. Total time on the bridge from first page to clean plan was 2 hours 40 minutes. About 45 minutes of that was the investigation; the rest was careful, slow imports with verification between each one.

flowchart TD
  A[terraform plan shows mass destroy/recreate] --> B{Are the resources actually broken in cloud?}
  B -- No, healthy --> C[State file is the problem, not cloud]
  B -- Yes, broken --> Z[Different incident; investigate cloud-side]
  C --> D[list-object-versions on tfstate]
  D --> E{Multiple writes in short window?}
  E -- Yes --> F[Identify last clean version pre-collision]
  E -- No --> Y[Investigate other corruption causes]
  F --> G[Backup current broken state locally]
  G --> H[copy-object to restore clean version]
  H --> I[terraform plan: short diff = divergence window]
  I --> J[terraform import each drifted resource]
  J --> K{Plan empty?}
  K -- No --> J
  K -- Yes --> L[Recovery complete; write postmortem]

Decision flow we use for any state-collision incident. The first branch matters most: confirm the cloud is healthy before touching state.

Diagram renders at the canonical version.

Two tempting shortcuts that would have made it worse

What we tried that we will not try again

Two shortcuts came up on the bridge that we ruled out. They are worth naming because both of them sound reasonable when you are tired.

1. Let terraform apply rebuild everything, The plan was already there. Just type yes. This would have caused 30 to 90 minutes of staging downtime for resources that did not need rebuilding, broken any data-layer resources with state of their own, and lost the audit trail of what had actually changed.
2. terraform refresh to fix the state, Refresh updates state from the live infrastructure for known resources. It does not learn about resources the state has forgotten, and it cannot undo a structurally corrupted state. Refresh on a Frankenstein state can deepen the damage by writing the merged view back as the new truth.

We have written about the broader pattern in the Terraform state recovery playbook, specifically the rule we now apply on every state incident: the state file is the suspect until proven otherwise. Cloud is healthy until you have evidence it is not. That ordering keeps you from running destructive applies under time pressure.

A pre-apply lock check that prints the holder's age

What we changed afterwards

The team made two changes the week after the incident. Both are small. Both have already paid for themselves.

The first change is a pre-apply wrapper script that reads the DynamoDB lock table before terraform apply runs. If a lock exists, the script prints the lock holder, when the lock was acquired, and how long ago that was. If the lock is younger than the workspace's typical apply duration plus a safety margin, the script refuses to run and tells the engineer to wait. If the lock is genuinely old (older than any plausible live process), the script still does not force-unlock automatically; it prints the exact force-unlock command and makes the engineer paste it. The friction is the point.

#!/usr/bin/env bash
# pre-apply-lock-check.sh
set -euo pipefail

WORKSPACE="${1:?workspace name required}"
LOCK_TABLE="acme-tfstate-locks"
MAX_PLAUSIBLE_APPLY_SECONDS=1800  # 30 minutes

LOCK_ITEM=$(aws dynamodb get-item \
  --table-name "$LOCK_TABLE" \
  --key "{\"LockID\":{\"S\":\"acme-tfstate-staging/env/${WORKSPACE}/terraform.tfstate-md5\"}}" \
  --output json 2>/dev/null || echo '{}')

if [[ "$(echo "$LOCK_ITEM" | jq -r '.Item // empty')" == "" ]]; then
  echo "No lock. Safe to proceed."
  exit 0
fi

HOLDER=$(echo "$LOCK_ITEM" | jq -r '.Item.Info.S' | jq -r '.Who + " @ " + .Operation')
CREATED=$(echo "$LOCK_ITEM" | jq -r '.Item.Info.S' | jq -r '.Created')
AGE=$(( $(date +%s) - $(date -d "$CREATED" +%s) ))

echo "Lock present."
echo "  Holder:  $HOLDER"
echo "  Created: $CREATED"
echo "  Age:     ${AGE}s"

if (( AGE < MAX_PLAUSIBLE_APPLY_SECONDS )); then
  echo
  echo "REFUSING TO PROCEED. Lock is younger than max plausible apply duration."
  echo "Wait for the current holder to finish, or confirm out-of-band that it is dead."
  exit 1
fi

echo
echo "Lock is older than ${MAX_PLAUSIBLE_APPLY_SECONDS}s. It may be stale."
echo "To force-unlock, run manually (do NOT automate this):"
echo "  terraform force-unlock $(echo "$LOCK_ITEM" | jq -r '.Item.Info.S' | jq -r '.ID')"
exit 2

We run this from CI and from a pre-apply git hook on engineer laptops. Same script, same rules, both places.

The second change is operational. The team's runbook now says: if you ever run force-unlock, page the on-call channel immediately with the lock ID and the reason. That single message would have caught this incident before it became one. The CI job would have replied within seconds that it was still running, and the engineer would have known to wait the eight minutes instead of clobbering the state.

We have stopped recommending that teams treat force-unlock as a routine command. It is a recovery command. It belongs in the same mental category as DROP TABLE: technically available, occasionally necessary, never the first thing you reach for. The TTL on the lock is generous on purpose. Wait it out, or confirm the holder is dead. Those are the only two paths.

When the state file is the suspect and the clock is running

If you are looking at a destroy plan you do not trust

The hard part of state-collision incidents is not the recovery commands. The commands are mechanical once you know the shape of the problem. The hard part is the 20 minutes before that, when an apply plan is sitting in your terminal showing 30+ destroys, someone senior is asking on Slack whether you can just run it, and you have to decide whether the cloud is broken or the state is. Get that wrong under pressure and you cause the outage you were trying to prevent.

We run these recovery engagements every week. The force-unlock-collision pattern has shown up four times this quarter alone, in three different shapes: a CI plan racing an engineer apply (this one), two engineers applying simultaneously after a Slack misunderstanding, and a long-running import operation that an engineer killed because they thought it had hung. The recovery shape is the same. The diagnostic discipline of confirming the cloud is healthy before touching state is the same. The thing that changes is which version of state is the right one to restore to, and that takes practice to spot quickly.

If you are staring at a terraform plan that wants to destroy resources you know are healthy, do not run apply. Book an infrastructure review with our team and we will be on a bridge with you the same day to work through the state restore and the surgical imports. We have done this enough times that we can usually have you back to an empty plan inside three hours.

Originally published at https://infraforge.agency/insights/terraform-force-unlock-state-divergence-recovery/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Why a forgotten RDS replica added $8,600 to one AWS bill

Muhammad Hassaan Javed — Tue, 19 May 2026 17:23:31 +0000

The finance lead forwarded the AWS bill on a Monday morning with three question marks in the subject line. The number had gone from a steady $3,200/month to $11,800 in six days. The on-call engineer's first guess, sensible enough, was that a data scientist had left a cross-region Athena job running over the weekend. It was not. It was an RDS read replica in a different AZ from its primary, provisioned a month earlier for a one-off load test, never decommissioned, retrying a replication-stream write every 50 milliseconds because somebody had flipped the primary's binlog format mid-stream. Nobody had read from the replica in three weeks. It had been quietly burning cross-AZ data transfer the whole time.

Problem signals:

AWS bill jumped 2-4x in under a week with no traffic or feature change
Cost Explorer concentrates the spike on DataTransfer-Regional-Bytes and RDSInstance line items
An RDS read replica sits in a different AZ than its primary and shows jagged ReplicaLag (spikes to 30s, drops to 0.5s, repeats)
No application config or BI tool actually points at the replica's endpoint
Recent schema or replication change on the primary that nobody coordinated with replica owners

Chasing the analytics query that did not exist

What we thought it was first

Almost every cost spike I have seen in the last three years gets blamed on analytics first. There is usually a junior data person, a notebook, a forgotten SELECT *, and a story everyone tells themselves. So we did the natural thing. We pulled the Athena query history for the previous ten days. Nothing unusual. We checked Redshift, which the team barely uses. Idle. We checked the data warehouse cluster's autoscaling history. Flat.

The clue was in Cost Explorer, but only when we grouped by usage type instead of by service. The RDS line item was up, sure, but the line item that had really moved was DataTransfer-Regional-Bytes. That is the meter for cross-AZ traffic inside a single region. Analytics queries do not typically light that meter up unless somebody has put a compute node in one AZ and the data in another, which would have been a much weirder problem.

Cross-AZ data transfer at that volume meant something was constantly shipping bytes between two availability zones. The shape of the bill said: find the thing that talks to itself across AZs at high frequency.

How we found the orphan replica

The diagnostic turn

We listed every RDS instance in the account and compared the AZ of each replica to its primary. One read replica was in us-east-1b while its primary was in us-east-1a. That alone is not a problem; cross-AZ replicas exist for legitimate HA reasons. What was odd was that this replica was tagged with nothing. No Owner. No Purpose. No Environment. Just the default Name tag, which read load-test-replica-temp.

# List replicas with their AZ and their primary's AZ
aws rds describe-db-instances \
  --query 'DBInstances[?ReadReplicaSourceDBInstanceIdentifier!=`null`].[DBInstanceIdentifier,AvailabilityZone,ReadReplicaSourceDBInstanceIdentifier,DBInstanceStatus]' \
  --output table

# Then for each primary, get its AZ
aws rds describe-db-instances \
  --db-instance-identifier <primary-id> \
  --query 'DBInstances[0].AvailabilityZone'

The two commands that surfaced the orphan in about 30 seconds.

The replica's CloudWatch ReplicaLag metric was the giveaway that this was not a healthy idle replica. It would spike to 30 seconds, drop to 0.5 seconds, spike again, every minute or so. That sawtooth pattern means the replication thread is failing and retrying. We pulled the replica's error log and found the same line repeating roughly every 50 milliseconds: a binlog format mismatch. Someone had changed the primary from MIXED to ROW format three weeks earlier, and the replica had been retrying the broken stream ever since.

Every retry shipped a chunk of binlog across the AZ boundary. At 50ms intervals, 24 hours a day, for three weeks. That was the bill.

The five-minute check that prevents the worse outcome

What we did before deleting anything

The instinct, when you have found the thing burning money, is to kill it immediately. We did not. The worse outcome here is not 'replica costs another hour of cross-AZ transfer'. The worse outcome is 'replica gets deleted, a quarterly BI dashboard breaks on Friday, and finance is back in your inbox with a different question'.

So we did the cheap verification first. We grepped the application monorepo for the replica's endpoint hostname. Zero hits. We checked the BI tool's data sources (Metabase in this case). Nothing pointed at it. We checked the data team's Airflow DAGs. Clean. We checked Terraform state to see how it had been created. It was in a workspace tagged load-test that had not been touched in a month, and the engineer who created it had left the company three weeks earlier.

If something had pointed at it, The right move would have been to keep the replica, fix the binlog format, and decide whether the read pattern actually justified cross-AZ. Deletion would have caused a worse incident than the cost spike.
Nothing pointed at it, Delete with --skip-final-snapshot. The replica was already corrupted by the binlog mismatch; a final snapshot was worthless. Cost stopped accruing within minutes.

aws rds delete-db-instance \
  --db-instance-identifier load-test-replica-temp \
  --skip-final-snapshot

The actual delete, once we were confident nothing depended on the replica.

Tag hygiene, expiration sweeps, and an anomaly budget that would have caught this on day 2

What we changed afterwards

Forgotten resources are the largest single category of cloud waste I see in client accounts. Bigger than oversized instances. Bigger than reserved-instance gaps. The fix is mechanical. Every cost-generating resource needs three tags: Owner, Purpose, ExpiresAt. ExpiresAt is the one most teams skip and the one that does the work.

We deployed a small Lambda on a weekly schedule that walks RDS, EC2, ELB, ElastiCache, and OpenSearch, finds resources past their ExpiresAt date or missing tags entirely, and posts to a Slack channel pinging the Owner. The owner has two weeks to either re-tag with a new ExpiresAt or delete. Resources with no Owner go to the platform team's queue. The first sweep flagged 47 resources across the account. Six of them were costing real money.

flowchart TD
  A[Weekly Lambda runs] --> B{Resource has<br/>Owner, Purpose,<br/>ExpiresAt tags?}
  B -- no --> C[Post to platform team queue]
  B -- yes --> D{ExpiresAt<br/>in past?}
  D -- no --> E[Skip]
  D -- yes --> F[DM the Owner in Slack]
  F --> G{Owner responds<br/>within 14 days?}
  G -- extends --> H[Update ExpiresAt]
  G -- no response --> I[Auto-tag for deletion<br/>review next sweep]

The sweep logic. About 180 lines of Python in practice.

Diagram renders at the canonical version.

The second change was AWS Budgets with anomaly detection scoped per service. The team had a single account-wide budget set at $5,000/month, which is useless for catching this kind of incident because the spike was concentrated in one service and the account total only crossed $5,000 on day five. A per-service budget on RDS set at $4,000 with a 20% variance threshold would have fired on day 2. The alert that matters is the one that fires before you have spent the money, not after.

The third change was a process one. The original binlog format change had been an uncoordinated database tweak from a senior engineer who had not realized a replica existed. Schema and replication changes now require a checklist that includes 'list all replicas of this primary and confirm they support the new config' as a pre-flight step. It is not glamorous. It would have prevented the entire incident.

Where cost spike triage gets stuck

If your AWS bill just jumped and you do not know why

The hard part of a cost spike is not finding the resource. It is being confident enough to delete it. Most teams we work with have at least one orphan RDS, ElastiCache, or NAT gateway they are afraid to touch because nobody remembers what depends on it. The triage takes a day; the courage to act takes a week of meetings. By then the bill has run another $2,000.

We run cost spike triage engagements every month. We have seen the orphan-replica case four times this year, the NAT-gateway-in-the-wrong-AZ case more often than that, and a half dozen variants of 'load test that never got cleaned up' across CloudWatch Logs, OpenSearch, and Aurora Serverless. The pattern is almost always the same: a resource that nobody owns, a tag policy that was never enforced, and a budget alert tuned too coarse to catch concentration in a single service. We have written more on the underlying patterns in the cloud cost spikes problem brief and across our services.

If your AWS bill jumped this month and you cannot point at the resource with confidence, book an infrastructure review with our team and we will start with a 30-minute diagnostic call this week. Cost stops accruing the day we find the orphan.

Originally published at https://infraforge.agency/insights/forgotten-rds-replica-cross-az-cost-spike/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Why terraform apply fails when plan passes: the map(any) trap

Muhammad Hassaan Javed — Tue, 19 May 2026 17:14:53 +0000

The on-call engineer pinged me at 4:42pm on a Friday with the release window open until 5:30. terraform apply against the staging workspace had failed with Error: Unsupported argument deep inside a child module nobody on the team had touched in seven months. terraform plan against the same workspace ran clean. They had already re-run plan twice and got fresh no-op output both times. The shape of the failure was off. plan and apply diverging is rare in the way they were describing, and you mostly see it on data sources that resolve at apply time, not on a static merge() call inside a module whose code had not changed in six months.

Problem signals:

terraform plan succeeds locally but terraform apply fails on a specific environment
The error is Error: Unsupported argument or Inappropriate value deep inside a child module
The traceback points at a merge() or lookup() call inside a module that has not been edited in months
Your root module input list has crossed 20 variables and several are typed any or map(any)
There is no CI job that runs terraform plan against every environment on every PR

Three hypotheses, three dead ends, twenty-two minutes left in the release window

What we ruled out in the first 18 minutes

The first thing the on-call lead suggested was state drift. Someone, somewhere, had terraform import-ed a resource by hand. We checked the audit log. No import events in the past 30 days. We checked the lock table in DynamoDB. The lock had been released cleanly by the previous successful apply at 2:11pm.

The second hypothesis was provider version drift. The team had recently bumped hashicorp/aws from 5.62 to 5.71 in versions.tf. A breaking change in a resource schema can absolutely cause an Unsupported argument error if apply pulls a newer provider than plan resolved against. We pinned both runs to 5.71 explicitly, deleted .terraform/, re-ran init, then plan, then apply. Same error, same module, same line.

The third hypothesis was a stale workspace. terraform workspaces sometimes diverge from the configuration if workspace select was bypassed by an engineer who exported TF_WORKSPACE and forgot. We ran terraform workspace show and verified it matched the intended target. The plan output even confirmed the right resource addresses.

Three explanations, three dead ends, twenty-eight minutes burned. The release window was now twenty-two minutes wide and shrinking. The on-call lead asked whether we should just roll back the deploy and figure it out Monday. I asked one more question first.

The 15th map(any) input that had been silently incubating for three weeks

Where the collision actually lived

I asked the on-call lead to walk me through what had merged into the workspace in the past two weeks. There were six commits. Five were obvious changes (image tags, a new IAM policy, a security group port). The sixth was a feature flag, added as a 15th map(any) input on the root module by an engineer who had joined six weeks earlier.

That was the lead.

The root module had 28 input variables. 14 of them were any-typed or map(any) to absorb per-environment overrides accumulated over six years of feature additions. The new feature flag added a 15th map(any) input named feature_overrides. Its values flowed through a merge() chain down to the database child module, which did its own merge(var.feature_overrides, local.legacy_db_flags) inside modules/services/database/locals.tf.

The two maps had a key collision. Both contained a key named read_replica_routing. The new input's value was a string. The legacy local's value was a map(object({ host = string, weight = number })). merge() resolves collisions by taking the last argument's value, but the argument order in this case depended on which input was non-empty at apply time, and the new feature flag was only non-empty in staging.

sequenceDiagram
  participant Op as Operator
  participant Plan as terraform plan
  participant Apply as terraform apply
  participant Child as child module
  Op->>Plan: feature_overrides (map(any))
  Plan->>Child: merge(map(any), map(any))
  Child-->>Plan: any (type-check deferred)
  Plan-->>Op: 0 to add, 0 to change (PASS)
  Op->>Apply: same input
  Apply->>Child: merge resolved to concrete value
  Child-->>Apply: Error: Unsupported argument
  Apply-->>Op: FAIL at 4:42pm

How map(any) defers type-checking past plan and surfaces it at apply

Diagram renders at the canonical version.

The collision had been latent for three weeks. plan succeeded because terraform's planner walked the call graph with both maps' element types collapsed to any. The merged value passed type-check as any, which type-checks against anything. apply, which actually constructs the resource, evaluated the merged value against the receiving attribute's concrete type signature and discovered the value was a string where an object was required.

That is the part that hurts. Terraform's any type defers all type-checking until apply. Every map(any) input on a root module is a future apply-time failure waiting on a contributor who does not know the implicit shape.

Three options, one open release window, seven minutes to pick

What we did before running apply again

We had three options and one open release window. I walked the on-call lead through them on the bridge call.

1. Delete the legacy key, Fastest. Also the riskiest: the legacy read_replica_routing key was referenced by three modules-of-modules three layers down. Deleting it would have moved the failure from staging to production an hour later.
2. Rename the new key, Safe-feeling. Left the underlying any-typed contract intact. Two months later a different contributor would add another map(any) input and we would be back on a Friday afternoon with the same shape of failure.
3. Rename plus add validation, Slower. Renamed the new key to feature_routing_overrides AND added a validation block on the input that explicitly rejected the colliding shape at plan time going forward. Stopped the immediate reoccurrence.

Option three carried the day. The rename took seven minutes. The validation block took twelve. apply succeeded at 5:14pm with sixteen minutes to spare on the release window. The release shipped on time.

The audit work behind option one (the one we did NOT take) is what stuck with me. The next morning, we grep-ed the entire terraform/ tree for read_replica_routing to map every consumer. Seven references across four modules. Three in modules/services/database/locals.tf itself. One in modules/monitoring/cloudwatch.tf. One in modules/services/cache/lookups.tf, which read the value to construct its own routing decision and would have broken silently if we had deleted the legacy key the night before. The remaining two were in a state-recovery helper module the team had forgotten existed. We had nearly fired the second shot of our own foot.

We left a tombstone comment on the legacy key and an open PR that would, the following week, replace its map(any) type with a proper object({ ... }) schema. That work landed five days later. The downstream consumers caught the change at plan time, and three of them needed minor patches before the type tightening could merge. None of those patches would have caught the original collision. They all caught real existing bugs the any type had been hiding.

Two policy changes and one structural fix

What we changed afterwards

Two policy changes came out of that night, and one structural fix took longer.

The first policy: no new map(any) or any-typed inputs on root modules. The team's terraform/ directory has a pre-commit hook (8 lines of grep) that fails the commit if any new variable block contains type = any or type = map(any). Existing instances are grandfathered, with a TODO list tracked against each module. Three of the original 14 have been converted to typed objects so far. The hook has fired four times in the six weeks since.

The second policy: every PR runs terraform plan against every environment, not just the one the contributor cares about. A matrix job in CI runs plan -var-file=envs/<env>.tfvars across all four environments and fails the PR if any of them errors. This would not have caught the original collision (plan succeeded everywhere), but it catches a different class of failure where one environment's tfvars hits an unwritten code path.

# Before: latent any-typed input
variable "feature_overrides" {
  type        = map(any)
  default     = {}
  description = "Per-environment feature flag overrides"
}

# In modules/services/database/locals.tf
locals {
  merged_flags = merge(
    local.legacy_db_flags,
    var.feature_overrides,
  )
}

# Above passes plan even when the two maps have a key
# whose value types disagree. The mismatch surfaces only
# at apply, when the receiving attribute is evaluated.

# After: typed, explicit, errors at plan time
variable "feature_overrides" {
  type = map(object({
    enabled     = bool
    rollout_pct = optional(number, 0)
    routing     = optional(string, "default")
  }))
  default     = {}
  description = "Per-environment feature flag overrides"

  validation {
    condition = alltrue([
      for k, v in var.feature_overrides :
      v.rollout_pct >= 0 && v.rollout_pct <= 100
    ])
    error_message = "rollout_pct must be between 0 and 100."
  }
}

The same variable, before and after. The lower form fails plan, not apply, when a contributor passes the wrong shape.

The structural fix took longer. A 28-input root module is not a configuration problem, it is a service-boundary problem. The team running the database stack should own a database/ root module with four inputs, not a 14-input subtree of a shared 28-input root. We split the original root into three roots along ownership boundaries (network, services, observability) using a thin terragrunt overlay for the cross-cutting variables. The split took six weeks of careful state-mv work to land without downtime. We have written more on the structural fix in the Terraform and IaC debt playbook, which covers when a shared root module starts costing more than the consistency it buys.

What we tell every team now: strong types in Terraform are not bureaucracy, they are the documentation. The half-day cost to write object({ name = string, enabled = bool, ... }) instead of map(any) buys you a plan-time failure instead of an apply-time failure, and apply-time failures land at 4:42pm on Fridays. We have stopped accepting map(any) inputs in any client engagement that involves an IaC audit, and we have not had a single contributor push back once they saw the cost.

If you are looking at a 28-input root with map(any) sprinkled through it

When your own root module is past 20 inputs

If you are reading this and your terraform/ directory has a root module past 20 inputs with several map(any) types in the input list, the failure you are heading toward is not a surprise. It is a scheduled event. The trigger will be a new contributor who does not know the implicit contract, plus one bad-enough Friday. The hardest part of cleaning it up is not the typing work itself; it is the audit of downstream consumers that have been silently depending on the loose contract for years. Two layers of modules-of-modules can hide a reference that breaks the moment you tighten the type, and your CI will not warn you because plan will keep passing right up to the apply that surfaces it.

We run these recovery and audit engagements every week. The map(any) collision pattern is the third-most-common shape we see in seed-to-Series-B SaaS Terraform repos, right after stale state lock holders and provider-version-drift cascades. It is one variant of the broader terraform apply fear problem we engage on most weeks. On a typical engagement we map every any-typed input in your root modules within the first day, prioritize them by blast radius, and either convert them in-place or split the root if the input count is the real problem. If you are looking at a Terraform root with map(any) sprinkled through it and a release window that does not forgive a 4pm apply failure, book an infrastructure review with our team and we will start with a 30-minute diagnostic call this week.

Originally published at https://infraforge.agency/insights/terraform-apply-fails-map-any-trap/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Init container cascade when every kubectl patch reverts in 10 seconds

Muhammad Hassaan Javed — Fri, 15 May 2026 20:15:23 +0000

The Slack ping came in at 2:14 am. Two replicas of the fanout service were stuck in Init:1/3 and the deploy queue behind them had grown to seven changes. The on-call engineer had already tried the obvious move, kubectl edit deployment, and the changes had reverted within ten seconds. By the time we joined the bridge, they had patched the same field four times in twenty minutes and were starting to wonder if etcd was corrupted. The shape of the failure was wrong though. Init containers do not normally cascade across three different upstream dependencies at once; either something upstream was common, or the spec was being rewritten under us.

Problem signals:

Pods stuck in Init:0/3 or Init:1/3 with no forward progress and no clear log story
kubectl edit deployment changes revert within ten to fifteen seconds, every time
Three init containers each failing in a different protocol layer (TCP dial timeout, NXDOMAIN, AMQP ACCESS_REFUSED)
A topology or schema ConfigMap claims state that the live broker or database disagrees with
No activeDeadlineSeconds set on init containers, so transient failures wedge the Pod indefinitely

Two replicas wedged, seven changes queued, four failed patches

The 2 am page

When we joined the bridge, the on-call engineer had already burned forty minutes on what looked like a config drift bug. The fanout service in the platform namespace had two replicas, both stuck in Init:1/3. The init container chain had three steps (wait-for-redis, wait-for-mongodb, wait-for-rabbitmq) and the redis step was failing on a hardcoded IPv4 address that did not match the live Service. They patched the env var on the Deployment. The init container restarted. Ten seconds later the IP was back. They patched it again. Same thing.

Their working hypothesis was etcd corruption or a faulty kube-apiserver caching layer. We have seen both before, but neither matches the symptom shape here. Etcd corruption surfaces as 5xx responses to kubectl, not as silent successful PATCHes that revert. We needed to find what was doing the reverting before we wasted any more time on the symptoms.

Two wrong guesses before the real culprit became visible

What we thought it was first

The first guess was a GitOps controller with self-heal enabled. ArgoCD does this with syncPolicy.automated.selfHeal: true. Flux does this with its Kustomization controller. Both will revert a kubectl patch within seconds if the live spec drifts from the source of truth in git. We checked the cluster for both. No Argo Application referenced the fanout namespace. Flux was not installed at all.

The second guess was a mutating admission webhook. A custom webhook that rewrites init container specs at admission time could in theory produce this pattern, except admission webhooks fire on create and update, not on a ten-second timer. We ran kubectl get mutatingwebhookconfigurations and the output was empty. That ruled it out.

The reverting was not coming from inside the cluster. It had to be coming from the node itself. We SSHed to the node where one of the fanout pods was scheduled and went looking. Within two minutes we had it.

$ ssh node-01 'ps -ef | grep admission'
root  1842  ... /usr/bin/supervisord -c /etc/supervisor/conf.d/admission.conf
root  2104  ... /bin/bash /var/lib/apex/admission.sh

$ ssh node-01 'cat /etc/supervisor/conf.d/admission.conf'
[program:admission]
command=/var/lib/apex/admission.sh
autorestart=true
startsecs=5

A supervisord-managed script on the node was the reverter. autorestart=true meant killing it bought us at most a few seconds.

The stored ConfigMap was the source of truth, not the live Deployment

What was actually overwriting our patches

The script at /var/lib/apex/admission.sh ran every ten seconds. It read three fields (redis-host, mongodb-host, amqp-uri) from a ConfigMap called fanout-init-config and patched them straight into the init container env vars on the live Deployment. The ConfigMap was the source of truth. The Deployment was a downstream artifact. Patching the Deployment was about as durable as writing in pencil.

sequenceDiagram
  participant Engineer
  participant Deployment
  participant Admission as node script
  participant ConfigMap as fanout-init-config
  Engineer->>Deployment: kubectl edit (fix redis-host)
  Deployment-->>Engineer: spec updated
  Note over Admission: tick every 10s
  Admission->>ConfigMap: read fields
  ConfigMap-->>Admission: stale values
  Admission->>Deployment: patch init container env
  Deployment-->>Engineer: changes reverted

The reverting loop. Edit the ConfigMap, not the Deployment.

Diagram renders at the canonical version.

This pattern shows up in places where the original GitOps story had gaps and someone wrote a node-side enforcer as a stopgap. Then the team rotated, the wiki page got out of date, and the enforcer kept running. We have seen this exact shape three times in the last year. Twice with supervisord scripts. Once with a systemd timer. The fix is always the same: find the source of truth before patching anything, and if you cannot find it in under fifteen minutes, stop and look on the nodes.

What each failure actually told us, and the fourth fix that did not show in any log

Three init containers, three different protocols

Once we knew to edit the ConfigMap, we still had three concurrent faults to diagnose. Each init container was failing in a different layer of the network stack, and each one had its own diagnostic signature.

The redis init container was dialing 10.43.181.44 on port 6379 and getting i/o timeout after thirty seconds. We compared against the live Service and got back a different ClusterIP.

$ kubectl get svc redis -n platform -o jsonpath='{.spec.clusterIP}'
10.43.218.92

$ kubectl logs fanout-7d4b9c-xx -c wait-for-redis -n platform | tail -3
dial tcp 10.43.181.44:6379: i/o timeout
dial tcp 10.43.181.44:6379: i/o timeout
dial tcp 10.43.181.44:6379: i/o timeout

The hardcoded IP had no relationship to the live Service. ClusterIPs are not stable across Service recreation. Hardcoding one is a time bomb.

The mongodb init container was logging 'lookup mongo.platform.svc.cluster.local on 10.43.0.10:53: no such host'. The live Service was named mongodb, not mongo. One character off, NXDOMAIN. We caught it by running kubectl get svc -n platform and reading the actual Service name out loud. The hostname in the ConfigMap had been typed from memory by someone who remembered the team's old naming convention.

The rabbitmq init container was the most interesting of the three. The TCP connection succeeded. The AMQP frame negotiation succeeded. Authentication succeeded. The vhost open returned ACCESS_REFUSED. The URI was amqp://app:app@rabbitmq:5672/fanout-internal. We port-forwarded to the management API and listed valid vhosts.

$ kubectl port-forward -n platform svc/rabbitmq 15672:15672 &
$ curl -s -u app:app http://localhost:15672/api/vhosts | jq -r '.[].name'
/
/platform

# fanout-internal does not exist on this broker

The URI parsed cleanly and authenticated cleanly. The failure was at vhost open. Always enumerate vhosts before assuming auth or credentials.

There was a fourth fix that did not show up in any log. None of the init containers had activeDeadlineSeconds set, and neither did the Pod spec. Even after the three protocol bugs were resolved, a transient DNS hiccup or broker restart would have hung an init container indefinitely instead of failing fast and letting the kubelet retry the Pod. We added activeDeadlineSeconds: 120 on every init container and 600 at the Pod level. Defense in depth, because init container deadlines do not always catch the case where the kubelet keeps reconciling a stuck container.

A second ConfigMap with the same shape, intentionally broken, was a load-bearing canary

The look-alike ConfigMap we almost broke

Before we patched fanout-init-config, we almost made one more mistake. There was a second ConfigMap in the same namespace called fanout-init-config-canary. Same shape, same broken-looking IP, same broken-looking AMQP URI. It was labeled role: protected and annotated with purpose: chaos-canary. A drift-detection job in the cluster read it every fifteen minutes to confirm its own detection logic still fired on broken inputs. If we had run a sed-style global replace across all matching ConfigMaps (which is exactly what a tired engineer at 3 am tends to do) we would have silenced the canary and the team would have learned about the next round of real drift only when a customer noticed.

When you patch infrastructure under pressure, target the named resource, not the pattern. Read the labels and annotations of every resource you are about to touch. A surprising number of clusters have load-bearing decoys you do not know about until you break them. We have written more on this in the Kubernetes and CI/CD stabilization pillar.

Source-of-truth guard, deadline defense, a validation Job, and convergence checks

What we changed afterwards

The fanout service was the visible failure, but the recovery exposed five underlying gaps in the team's release flow. We left four durable changes in place before disconnecting from the bridge.

The fanout-init-config ConfigMap is now committed in git and synced via a real GitOps controller, and the node-side admission script was rewritten to refuse to overwrite a Deployment if the ConfigMap's content hash does not match a known-good baseline annotation. The script can still enforce, but it cannot enforce a broken state.

Every Deployment in the platform namespace now has activeDeadlineSeconds set at both the init container level (120 seconds) and the Pod level (600 seconds). The pair matters. Init container deadlines fail-fast the individual container; the Pod-level deadline prevents the kubelet from looping retries on a Pod that is structurally wrong.

A pre-deployment validation Job runs as part of the release flow. It carries label validation: predeploy, restartPolicy: OnFailure, activeDeadlineSeconds: 120, and a validator that does three real checks: redis, mongodb, and rabbitmq Services each have non-empty Endpoints, AND the broker reports every binding the topology ConfigMap claims to have declared. Topology drift was the other half of this incident; the binding count had silently dropped from five to three after a partial migration three weeks earlier, and nobody had noticed because the topology-version annotation still said 5.

# Snippet from the topology-reconcile Job that fixed the broker drift
apiVersion: batch/v1
kind: Job
metadata:
  name: topology-reconcile-2026-05-15
  labels:
    validation: predeploy
spec:
  activeDeadlineSeconds: 120
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: reconcile
        image: rabbitmq:3.13-management
        command: ["/bin/bash", "-c"]
        args:
          - |
            set -euo pipefail
            EXPECTED=$(yq '.bindings | length' /config/topology.yaml)
            for b in $(yq -o=json '.bindings[]' /config/topology.yaml | jq -c .); do
              EX=$(echo $b | jq -r .exchange)
              QU=$(echo $b | jq -r .queue)
              RK=$(echo $b | jq -r ."routing-key")
              rabbitmqadmin declare binding source=$EX destination=$QU routing_key=$RK
            done
            ACTUAL=$(curl -s -u $USER:$PASS http://rabbitmq:15672/api/bindings | jq 'length')
            [ "$ACTUAL" -ge "$EXPECTED" ] || exit 1

Reconcile via Job, not via kubectl exec. The Job is observable, retryable, and leaves an audit record.

The team's rollback runbook now requires two consecutive green health observations twenty seconds apart before a rollout is declared finished. Single-shot green is not enough on a cluster that has a ten-second admission tick, because you can catch the Pod between reverts and declare victory ninety seconds before the next failure cascade. We learned to distrust single-shot green the hard way on a different engagement, and that is now the default in every recovery handover we ship.

If you are looking at a cluster where every patch reverts within seconds, do not patch faster. Stop patching and find what is doing the reverting. The fix itself is usually ten minutes once you know where the source of truth lives. Finding the source of truth is what takes the hour. If you want a second pair of eyes on a system that is in this state, request an infrastructure review and we will be on a bridge with you the same day.

Originally published at https://infraforge.agency/insights/init-container-cascade-reverting-patches/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.