wealways.online

Landing page
GIF
simulating service going down
Example of issue getting acknowledged and escelated (#unkown was the temp discussion channel)
Technical Diagram

OpenHeart MVP

~70% of production incidents take over an hour to resolve — not because engineers aren't smart, but because they're flying blind.

The on-call problem 📲

"i got paged at 3am and had no idea where to even start." — every on-call engineer, eventually.

most teams treat observability as an afterthought. metrics live in one place, logs in another, alerts fire into a void, and the runbook — if one exists — is a stale Google Doc nobody's touched since the last person who understood the system left.

the result: mean time to detect (MTTD) and mean time to resolve (MTTR) balloon. users notice before engineers do. incidents become firefights instead of investigations.

url shortener (but make it production-grade)

a URL shortener is the simplest possible web service — which makes it the perfect canvas for demonstrating what real production readiness looks like.

we built a URL shortener and then wrapped it in a complete incident response stack: structured logging, prometheus metrics, grafana dashboards wired to the 4 golden signals, alertmanager firing to discord, and a runbook that tells you exactly what to do when things break.

the chaos scripts let you break things on purpose. the dashboard tells you what broke and when. the runbook tells you how to fix it.

🚨 Track: incident response

🥉 bronze — stop using print statements

structured json logging with timestamps, log levels, and per-request context (method, path, status, duration, client ip)
/metrics endpoint exposing prometheus-format metrics: request counts, latency histograms, active in-flight requests, redirect cache hit/miss

🥈 silver — wake up the on-call engineer

ServiceDown and HighErrorRate alert rules wired through alertmanager
discord notifications via a custom webhook relay — alerts fire within 5 minutes of failure:
- On-call Engineer needs to acknowledge issue within 30 mins otherwise it gets escalated to higher up
  - re-pings every 10 mins until acknowledged
- Once acknowledged a temporary channel for the issue will be opened to communicate
- Engineers can also manually escalate the issue to ping higher-ups if issue is out of their scope
bonus: HighLatencyP95 alert fires when p95 latency exceeds 2s for 2+ minutes
Alert Logic Configuration: link

🥇 gold — total situational awareness

grafana dashboard covering all 4 golden signals (latency, traffic, errors, saturation) + infrastructure and postgresql internals rows
RUNBOOK.md with a playbook for every alert: what it means, step-by-step diagnosis, copy-paste fix commands, severity guide
chaos scripts (scripts/chaos.py) simulate high error rate, traffic spikes, and slow db — root cause diagnosable using only the dashboard and logs

🚀 Track: scalability

🥉 bronze — measure before you guess

k6 load test at 50 concurrent users
documented baseline p95 response time and error rate before any optimization

🥈 silver — one server isn't enough

2 gunicorn app containers behind nginx, spun up via docker compose
load test at 200 concurrent users, response times under 3 seconds
nginx handles load balancing across replicas with health checks

🥇 gold — the tsunami test

redis caching on short code lookups — cache hits skip the db entirely
pgbouncer connection pooling to prevent db saturation under load
load test at 500 concurrent users with <5% error rate
bottleneck report: pre-optimization, db connection exhaustion caused latency spikes around ~150 concurrent users. fixed by adding pgbouncer (pooled connections) and redis (eliminated redundant db reads for repeat urls).

🛡️ Track: reliability

🥉 bronze — prove it works before you ship it

/health endpoint returns 200 OK — used by nginx and docker for liveness checks
14-test pytest suite covering unit and integration tests
github actions CI runs the full test suite on every push and pull request to main

🥈 silver — stop bad code from reaching production

coverage gate enforced at 50% via --cov-fail-under=50 — CI fails if coverage drops below threshold
integration tests hit the actual API: POST /shorten, url resolution, user flows, event logging
deploy job in CI has needs: [test] — a failing test blocks the deploy entirely
all 404s and 500s return clean JSON (never raw stack traces), with error handlers for both HTTPException and unexpected errors

🥇 gold — break it on purpose. watch it survive.

all containers configured with restart: unless-stopped or restart: always — kill a container, it comes back automatically
test_graceful_errors.py verifies bad inputs return structured JSON errors, not crashes
RUNBOOK.md documents exact failure modes for every alert: what broke, why, and how to fix it

📋 Runbook: link