What inspired you?

I've always heard the phrase "you can't fix what you can't see" thrown around in engineering, but it never really clicked until I was debugging an issue with no logs, no metrics, and no idea what was happening. This quest gave me the chance to build and explore the kind of observability stack I wished I had in those moments.

What you learned?

This project taught me that observability is not nice to have, it's a core part of running any production system. I learned the difference between logs and metrics and how to use both. I also learned how alerting systems work end to end, from defining thresholds in Grafana to routing notifications to Discord, and why tuning alerts carefully matters to avoid alert fatigue.

How you built it?

I built the stack in three layers. First I added structured JSON logging to the Flask app using python-json-logger, writing logs to a file that Promtail picks up and ships to Loki. Then I added a /metrics endpoint using prometheus-flask-exporter which Prometheus scrapes every 15 seconds. Finally I connected Grafana to both Loki and Prometheus as data sources, built a four panel dashboard tracking request rate, error rate, latency, and memory usage, and configured two alert rules that fire to a Discord channel via webhook.

Challenges you faced?

The biggest challenge was securing the server. Early on the droplet was compromised by a cryptominer within hours of being created because all ports were open to the internet. I had to destroy the droplet and start fresh with a proper firewall, a non-root deploy user, and Postgres bound to localhost only. Another challenge was getting alerts to actually fire Grafana's nodata state was being treated as Normal by default, meaning when the app was completely down the alerts wouldn't trigger. Changing the no data behaviour to Alerting fixed this.